Part of Speech Tagging using Hidden Markov Model (HMM) POS Tagger and Brill Tagger

Last update: Dec 25, 2021

Related tags

Overview

Part of Speech Tagging using Hidden Markov Model (HMM) POS Tagger and Brill Tagger

In this project, our aim is to tune, compare, and contrast the performance of the Hidden Markov Model (HMM) POS tagger and the Brill POS tagger. To perform this task, we will train these two taggers using data from a specific domain and test their accuracy in predicting tag sequences from data belonging to the same domain and data from a different domain.

How to Execute?

To run this project,

Download the repository as a zip file.
Extract the zip to get the project folder.
Open Terminal in the directory you extracted the project folder to.
Change directory to the project folder using:

cd part-of-speech-taggers-main
Install the required libraries, NLTK and scikit-learn using the following commands:

pip3 install nltk

pip3 install -U scikit-learn
Now to execute the code, use any of the following commands (in the current directory):

HMM Tagger Predictions: python3 src/main.py --tagger hmm --train data/train.txt --test data/test.txt --output output/test_hmm.txt

Brill Tagger Predictions: python3 src/main.py --tagger brill --train data/train.txt --test data/test.txt --output output/test_brill.txt

Description of the execution command

Our program src/main.py that takes four command-line options. The first is --tagger to indicate the tagger type, second is --train for the path to a training corpus, the third option is --test for the path to a test corpus, and the fourth option is --output for the output file.

The two possible values for --tagger option are:

hmm for the Hidden Markov Model POS Tagger
brill for the Brill POS Tagger

The training data can be found in data/train.txt, the in-domain test data can be found in data/test.txt, and the out-of-domain test data can be found in data/test_ood.txt.

The output file must be generated in the output/ directory.

So specifying these paths, one example of a possible execution command is:

python3 src/main.py --tagger hmm --train data/train.txt --test data/test.txt --output output/test_hmm.txt

References

https://docs.huihoo.com/nltk/0.9.5/api/nltk.tag.hmm.HiddenMarkovModelTrainer-class.html

https://tedboy.github.io/nlps/generated/generated/nltk.tag.HiddenMarkovModelTagger.html

https://www.kite.com/python/docs/nltk.HiddenMarkovModelTagger.train

https://gist.github.com/blumonkey/007955ec2f67119e0909

https://docs.huihoo.com/nltk/0.9.5/api/nltk.tag.brill-module.html

https://www.nltk.org/api/nltk.tag.brill_trainer.html

https://www.nltk.org/_modules/nltk/tag/brill.html

https://www.geeksforgeeks.org/nlp-brill-tagger/

https://www.nltk.org/howto/probability.html

Part of Speech Tagging using Hidden Markov Model (HMM) POS Tagger and Brill Tagger

Related tags

Overview

Part of Speech Tagging using Hidden Markov Model (HMM) POS Tagger and Brill Tagger

How to Execute?

Description of the execution command

References

Owner

Chirag Daryani

Using Bert as the backbone model for lime, designed for NLP task explanation (sentence pair text classification task)

DANeS is an open-source E-newspaper dataset by collaboration between DATASET JSC (dataset.vn) and AIV Group (aivgroup.vn)

Big Bird: Transformers for Longer Sequences

A natural language modeling framework based on PyTorch

This Project is based on NLTK It generates a RANDOM WORD from a predefined list of words, From that random word it read out the word, its meaning with parts of speech , its antonyms, its synonyms

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

A cross platform OCR Library based on PaddleOCR & OnnxRuntime

CoSENT、STS、SentenceBERT

A BERT-based reverse dictionary of Korean proverbs

MiCECo - Misskey Custom Emoji Counter

A repo for materials relating to the tutorial of CS-332 NLP

An IVR Chatbot which can exponentially reduce the burden of companies as well as can improve the consumer/end user experience.

NAACL 2022: MCSE: Multimodal Contrastive Learning of Sentence Embeddings

This repo stores the codes for topic modeling on palliative care journals.

COVID-19 Chatbot with Rasa 2.0: open source conversational AI

CMeEE 数据集医学实体抽取

Yet Another Neural Machine Translation Toolkit

multi-label，classifier，text classification，多标签文本分类，文本分类，BERT，ALBERT，multi-label-classification，seq2seq，attention，beam search

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Wind Speed Prediction using LSTMs in PyTorch