Part of Speech Tagging using Hidden Markov Model (HMM) POS Tagger and Brill Tagger

Overview

Part of Speech Tagging using Hidden Markov Model (HMM) POS Tagger and Brill Tagger

In this project, our aim is to tune, compare, and contrast the performance of the Hidden Markov Model (HMM) POS tagger and the Brill POS tagger. To perform this task, we will train these two taggers using data from a specific domain and test their accuracy in predicting tag sequences from data belonging to the same domain and data from a different domain.

How to Execute?

To run this project,

  1. Download the repository as a zip file.

  2. Extract the zip to get the project folder.

  3. Open Terminal in the directory you extracted the project folder to.

  4. Change directory to the project folder using:

    cd part-of-speech-taggers-main

  5. Install the required libraries, NLTK and scikit-learn using the following commands:

    pip3 install nltk

    pip3 install -U scikit-learn

  6. Now to execute the code, use any of the following commands (in the current directory):

HMM Tagger Predictions: python3 src/main.py --tagger hmm --train data/train.txt --test data/test.txt --output output/test_hmm.txt

Brill Tagger Predictions: python3 src/main.py --tagger brill --train data/train.txt --test data/test.txt --output output/test_brill.txt

Description of the execution command

Our program src/main.py that takes four command-line options. The first is --tagger to indicate the tagger type, second is --train for the path to a training corpus, the third option is --test for the path to a test corpus, and the fourth option is --output for the output file.

The two possible values for --tagger option are:

  • hmm for the Hidden Markov Model POS Tagger

  • brill for the Brill POS Tagger

The training data can be found in data/train.txt, the in-domain test data can be found in data/test.txt, and the out-of-domain test data can be found in data/test_ood.txt.

The output file must be generated in the output/ directory.

So specifying these paths, one example of a possible execution command is:

python3 src/main.py --tagger hmm --train data/train.txt --test data/test.txt --output output/test_hmm.txt

References

https://docs.huihoo.com/nltk/0.9.5/api/nltk.tag.hmm.HiddenMarkovModelTrainer-class.html

https://tedboy.github.io/nlps/generated/generated/nltk.tag.HiddenMarkovModelTagger.html

https://www.kite.com/python/docs/nltk.HiddenMarkovModelTagger.train

https://gist.github.com/blumonkey/007955ec2f67119e0909

https://docs.huihoo.com/nltk/0.9.5/api/nltk.tag.brill-module.html

https://www.nltk.org/api/nltk.tag.brill_trainer.html

https://www.nltk.org/_modules/nltk/tag/brill.html

https://www.geeksforgeeks.org/nlp-brill-tagger/

https://www.nltk.org/howto/probability.html

Owner
Chirag Daryani
Software Engineer | Data Science | Machine Learning | Python | Blog: https://chiragdaryani.medium.com/
Chirag Daryani
Using Bert as the backbone model for lime, designed for NLP task explanation (sentence pair text classification task)

Lime Comparing deep contextualized model for sentences highlighting task. In addition, take the classic explanation model "LIME" with bert-base model

JHJu 2 Jan 18, 2022
DANeS is an open-source E-newspaper dataset by collaboration between DATASET JSC (dataset.vn) and AIV Group (aivgroup.vn)

DANeS - Open-source E-newspaper dataset Source: Technology vector created by macrovector - www.freepik.com. DANeS is an open-source E-newspaper datase

DATASET .JSC 64 Aug 17, 2022
Big Bird: Transformers for Longer Sequences

BigBird, is a sparse-attention based transformer which extends Transformer based models, such as BERT to much longer sequences. Moreover, BigBird comes along with a theoretical understanding of the c

Google Research 457 Dec 23, 2022
A natural language modeling framework based on PyTorch

Overview PyText is a deep-learning based NLP modeling framework built on PyTorch. PyText addresses the often-conflicting requirements of enabling rapi

Facebook Research 6.4k Dec 27, 2022
This Project is based on NLTK It generates a RANDOM WORD from a predefined list of words, From that random word it read out the word, its meaning with parts of speech , its antonyms, its synonyms

This Project is based on NLTK(Natural Language Toolkit) It generates a RANDOM WORD from a predefined list of words, From that random word it read out the word, its meaning with parts of speech , its

SaiVenkatDhulipudi 2 Nov 17, 2021
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

ASYML 2.3k Jan 07, 2023
A cross platform OCR Library based on PaddleOCR & OnnxRuntime

A cross platform OCR Library based on PaddleOCR & OnnxRuntime

RapidOCR Team 767 Jan 09, 2023
CoSENT、STS、SentenceBERT

CoSENT_Pytorch 比Sentence-BERT更有效的句向量方案

102 Dec 07, 2022
A BERT-based reverse dictionary of Korean proverbs

Wisdomify A BERT-based reverse-dictionary of Korean proverbs. 김유빈 : 모델링 / 데이터 수집 / 프로젝트 설계 / back-end 김종윤 : 데이터 수집 / 프로젝트 설계 / front-end / back-end 임용

94 Dec 08, 2022
MiCECo - Misskey Custom Emoji Counter

MiCECo Misskey Custom Emoji Counter Introduction This little script counts custo

7 Dec 25, 2022
A repo for materials relating to the tutorial of CS-332 NLP

CS-332-NLP A repo for materials relating to the tutorial of CS-332 NLP Contents Tutorial 1: Introduction Corpus Regular expression Tokenization Tutori

Alok singh 9 Feb 15, 2022
An IVR Chatbot which can exponentially reduce the burden of companies as well as can improve the consumer/end user experience.

IVR-Chatbot Achievements 🏆 Team Uhtred won the Maverick 2.0 Bot-a-thon 2021 organized by AbInbev India. ❓ Problem Statement As we all know that, lot

ARYAMAAN PANDEY 9 Dec 08, 2022
NAACL 2022: MCSE: Multimodal Contrastive Learning of Sentence Embeddings

MCSE: Multimodal Contrastive Learning of Sentence Embeddings This repository contains code and pre-trained models for our NAACL-2022 paper MCSE: Multi

Saarland University Spoken Language Systems Group 39 Nov 15, 2022
This repo stores the codes for topic modeling on palliative care journals.

This repo stores the codes for topic modeling on palliative care journals. Data Preparation You first need to download the journal papers. bash 1_down

3 Dec 20, 2022
COVID-19 Chatbot with Rasa 2.0: open source conversational AI

COVID-19 chatbot implementation with Rasa open source 2.0, conversational AI framework.

Aazim Parwaz 1 Dec 23, 2022
CMeEE 数据集医学实体抽取

医学实体抽取_GlobalPointer_torch 介绍 思想来自于苏神 GlobalPointer,原始版本是基于keras实现的,模型结构实现参考现有 pytorch 复现代码【感谢!】,基于torch百分百复现苏神原始效果。 数据集 中文医学命名实体数据集 点这里申请,很简单,共包含九类医学

85 Dec 28, 2022
Yet Another Neural Machine Translation Toolkit

YANMTT YANMTT is short for Yet Another Neural Machine Translation Toolkit. For a backstory how I ended up creating this toolkit scroll to the bottom o

Raj Dabre 121 Jan 05, 2023
multi-label,classifier,text classification,多标签文本分类,文本分类,BERT,ALBERT,multi-label-classification,seq2seq,attention,beam search

multi-label,classifier,text classification,多标签文本分类,文本分类,BERT,ALBERT,multi-label-classification,seq2seq,attention,beam search

hellonlp 30 Dec 12, 2022
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

ELECTRA Introduction ELECTRA is a method for self-supervised language representation learning. It can be used to pre-train transformer networks using

Google Research 2.1k Dec 28, 2022
Wind Speed Prediction using LSTMs in PyTorch

Implementation of Deep-Forecast using PyTorch Deep Forecast: Deep Learning-based Spatio-Temporal Forecasting Adapted from original implementation Setu

Onur Kaplan 151 Dec 14, 2022