Yaspeller Dictionary (Auto)builder

Usage

# this sample command generates `./yaspeller_report.json`
# yaspeller --report json --ignore-digits --ignore-text "'.*" --ignore-latin --only-errors --file-extensions ".md" --lang ru

python -m venv env
source env/bin/activate
pip install 
python src/dictionary.py yaspeller_report.json

Why

Yaspeller is nice, but there are too many anglicisms in a usual documentation. Normally you just want to ignore that, but there's the only possibility to add a regexp-array to ignore words.

This generates a array of dictionary words including all lexems for all cases like

[
    "[бБ]аг(а|ам|ами|ах|е|и|ов|ом|у)?",
    "[дД]ифф(а|ам|ами|ах|е|ов|ом|у|ы)?",
    "[кК]оммит(а|ам|ами|ах|е|ов|ом|у|ы)?",
    "[пП]атчинг(а|ам|ами|ах|е|и|ов|ом|у)?",
    "[рР]убист(а|ам|ами|ах|е|ов|ом|у|ы)?",
    "[сС]амоорганизованн(ого|ом|ому|ую|ые|ый|ым|ыми|ых)",
    "[тТ]икет(а|ам|ами|ах|е|ов|ом|у|ы)?",
    "коммитить"
]

from yaspeller errors (in text format looking like)

Spelling check:
✗ www.ruby-lang.org/ru/community/ruby-core/index.md 130 ms
-----
Typos: 9
1. патчингом (36:27)
2. коммитить (68:32, suggest: комитет)
3. багах (75:15, suggest: богах, баках, бегах)
4. баги (89:24, suggest: багги)
5. баг (96:25)
6. тикет (107:14, suggest: этикет)
7. дифф (115:18)
8. коммиту (147:24, suggest: комету, комнату)
9. коммита (148:58, suggest: комета)
-----

Live example

Initially created for www.ruby-lang.org translations spellchecking

🤕 spelling exceptions builder for lazy people

Related tags

Overview

Yaspeller Dictionary (Auto)builder

Usage

Why

Live example

Owner

Vlad Bokov

LOT: A Benchmark for Evaluating Chinese Long Text Understanding and Generation

Python library for Serbian Natural language processing (NLP)

Facilitating the design, comparison and sharing of deep text matching models.

The code for the Subformer, from the EMNLP 2021 Findings paper: "Subformer: Exploring Weight Sharing for Parameter Efficiency in Generative Transformers", by Machel Reid, Edison Marrese-Taylor, and Yutaka Matsuo

🏖 Easy training and deployment of seq2seq models.

chaii - hindi & tamil question answering

Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors"

[EMNLP 2021] LM-Critic: Language Models for Unsupervised Grammatical Error Correction

CVSS: A Massively Multilingual Speech-to-Speech Translation Corpus

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

Global Rhythm Style Transfer Without Text Transcriptions

Official code repository of the paper Linear Transformers Are Secretly Fast Weight Programmers.

GooAQ 🥑 : Google Answers to Google Questions!

STT for TorchScript is a port of Coqui STT based on DeepSpeech to PyTorch.

Speech Recognition Database Management with python

Deploying a Text Summarization NLP use case on Docker Container Utilizing Nvidia GPU

Installation, test and evaluation of Scribosermo speech-to-text engine

A framework for cleaning Chinese dialog data

Code for "Generating Disentangled Arguments with Prompts: a Simple Event Extraction Framework that Works"

Chinese Grammatical Error Diagnosis