A simple implementation of N-gram language model.

Last update: Nov 24, 2021

Related tags

Text Data & NLP n-gram

Overview

About

A simple implementation of N-gram language model.

Requirements

numpy

Data preparation

Corpus

Training data for the N-gram model, a text file like this:

曼联加油
懂球直播
有也免费高清的额
直播挺全的
曼联这局肯定胜利

Text lines will be split into tokens by a delimiter when training. By default, no delimiter given, text lines will be split into characters.

Tokens

The dictionary for the model, a text file, each line of which is a token. Every token is unique in the file.

光
衰
戒
颅
阖

Training

Run the script train_n_gram.py to train an N-gram model.

python train_n_gram.py --corpus_path data/tieba.dialogues --token_path data/charset.txt --model_path data/2-gram.model --n 2

Testing

Run the script test_n_gram.py to test the trained N-gram model.

python test_n_gram.py --token_path data/charset.txt --model_path data/2-gram.model --text 哈哈

The testing output will like:

INFO - Loaded model from data/2-gram.model
INFO - Model info:
	n: 2
	head2tail length: 5947
	tokens: 5952
The most probable next token of the '哈哈' is '哈'.

A simple implementation of N-gram language model.

Related tags

Overview

About

Requirements

Data preparation

Corpus

Tokens

Training

Testing

Owner

Opal-lang - A WIP programming language based on Python

Package for controllable summarization

Python library for processing Chinese text

Twitter-Sentiment-Analysis - Analysis of twitter posts' positive and negative score.

Rank-One Model Editing for Locating and Editing Factual Knowledge in GPT

Crowd sourced training data for Rasa NLU models

Protein Language Model

DeLighT: Very Deep and Light-Weight Transformers

Implementation of some unbalanced loss like focal_loss, dice_loss, DSC Loss, GHM Loss et.al

A library for end-to-end learning of embedding index and retrieval model

Contains links to publicly available datasets for modeling health outcomes using speech and language.

Spam filtering made easy for you

문장단위로 분절된 나무위키 데이터셋. Releases에서 다운로드 받거나, tfds-korean을 통해 다운로드 받으세요.

Library for Russian imprecise rhymes generation

A simple Streamlit App to classify swahili news into different categories.

This converter will create the exact measure for your cappuccino recipe from the grandiose Rafaella Ballerini!

Share constant definitions between programming languages and make your constants constant again

Train BPE with fastBPE, and load to Huggingface Tokenizer.

Comprehensive-E2E-TTS - PyTorch Implementation

Shellcode antivirus evasion framework