Persian Lexicon

This repo uses Uppsala Persian Corpus (UPC) to construct a lexicon of 70664 unique words. With all the excitement around game Wordle, we also extracted words with different length (2, 3, 4, ..., 10) and stored them to separate files for easier access. Please note that these files might contain offensive words, I have not check them manually.

GetWords.py can read these files and return words as a list of strings.

Cleanup details

Main Lexicon

The main lexicon (data/persian-words.txt) is build very liberally; we only filter out words that contain ASCII characters or Arabic numerals.

Fixed length Lexicons

More conservative filtering has been applied to files with fixed word length. We drop all words that contain any of the following characters:

After applying these filters, we ended up with these number of words per file:

2 letter words: 310 unique words
3 letter words: 2378 unique words
4 letter words: 7059 unique words
5 letter words: 10043 unique words
6 letter words: 9541 unique words
7 letter words: 7350 unique words
8 letter words: 4681 unique words
9 letter words: 2529 unique words
10 letter words: 1250 unique words

Persian-lexicon - A lexicon of 70K unique Persian (Farsi) words

Related tags

Overview

Persian Lexicon

Cleanup details

Main Lexicon

Fixed length Lexicons

Owner

Saman Vaisipour

HAIS_2GNN: 3D Visual Grounding with Graph and Attention

Statistics and Mathematics for Machine Learning, Deep Learning , Deep NLP

An open source framework for seq2seq models in PyTorch.

Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

NLP made easy

Reproduction process of BERT on SST2 dataset

Beautiful visualizations of how language differs among document types.

Modular and extensible speech recognition library leveraging pytorch-lightning and hydra.

Practical Natural Language Processing Tools for Humans is build on the top of Senna Natural Language Processing (NLP)

State-of-the-art NLP through transformer models in a modular design and consistent APIs.

SentimentArcs: a large ensemble of dozens of sentiment analysis models to analyze emotion in text over time

DziriBERT: a Pre-trained Language Model for the Algerian Dialect

Translates basic English sentences into the Huna language (hoo-NAH)

Russian GPT3 models.

Generate a cool README/About me page for your Github Profile

keras implement of transformers for humans

Chinese real time voice cloning (VC) and Chinese text to speech (TTS).

Winner system (DAMO-NLP) of SemEval 2022 MultiCoNER shared task over 10 out of 13 tracks.

DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference

Malaya-Speech is a Speech-Toolkit library for bahasa Malaysia, powered by Deep Learning Tensorflow.