A simple implementation of N-gram language model.

Last update: Nov 24, 2021

Related tags

Text Data & NLP n-gram

Overview

About

A simple implementation of N-gram language model.

Requirements

numpy

Data preparation

Corpus

Training data for the N-gram model, a text file like this:

曼联加油
懂球直播
有也免费高清的额
直播挺全的
曼联这局肯定胜利

Text lines will be split into tokens by a delimiter when training. By default, no delimiter given, text lines will be split into characters.

Tokens

The dictionary for the model, a text file, each line of which is a token. Every token is unique in the file.

光
衰
戒
颅
阖

Training

Run the script train_n_gram.py to train an N-gram model.

python train_n_gram.py --corpus_path data/tieba.dialogues --token_path data/charset.txt --model_path data/2-gram.model --n 2

Testing

Run the script test_n_gram.py to test the trained N-gram model.

python test_n_gram.py --token_path data/charset.txt --model_path data/2-gram.model --text 哈哈

The testing output will like:

INFO - Loaded model from data/2-gram.model
INFO - Model info:
	n: 2
	head2tail length: 5947
	tokens: 5952
The most probable next token of the '哈哈' is '哈'.

A simple implementation of N-gram language model.

Related tags

Overview

About

Requirements

Data preparation

Corpus

Tokens

Training

Testing

Owner

Source code for CsiNet and CRNet using Fully Connected Layer-Shared feedback architecture.

GPT-3: Language Models are Few-Shot Learners

Fast, general, and tested differentiable structured prediction in PyTorch

A library that integrates huggingface transformers with the world of fastai, giving fastai devs everything they need to train, evaluate, and deploy transformer specific models.

Predicting the usefulness of reviews given the review text and metadata surrounding the reviews.

AudioCLIP Extending CLIP to Image, Text and Audio

PortaSpeech - PyTorch Implementation

VoiceFixer VoiceFixer is a framework for general speech restoration.

A Paper List for Speech Translation

Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources (NAACL-2021).

🧪 Cutting-edge experimental spaCy components and features

Applying "Load What You Need: Smaller Versions of Multilingual BERT" to LaBSE

Label data using HuggingFace's transformers and automatically get a prediction service

The official code for “DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction”, ACM MM, Oral Paper, 2021.

[ICLR'19] Trellis Networks for Sequence Modeling

Clone a voice in 5 seconds to generate arbitrary speech in real-time

Code for text augmentation method leveraging large-scale language models

Proquabet - Convert your prose into proquints and then you essentially have Vogon poetry

A python package for deep multilingual punctuation prediction.

ACL'2021: Learning Dense Representations of Phrases at Scale