This repository details the steps in creating a Part of Speech tagger using Trigram Hidden Markov Models and the Viterbi Algorithm without using external libraries.

Last update: Dec 09, 2021

Overview

POS-Tagger

This repository details the creation of a Part-of-Speech tagger using Trigram Hidden Markov Models to predict word tags in a word sequence.

What is Part-of-Speech Tagging?

In corpus linguistics, part-of-speech tagging (POS tagging, PoS tagging, or POST), also known as "grammatical tagging," is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context. Once performed by hand, POS tagging is now done in the context of computational linguistics, using algorithms which associate discrete terms, as well as "hidden" parts of speech, by a set of descriptive tags. POS-tagging algorithms fall into two distinctive categories: rule-based and stochastic. Because applying a rule-based model to predict tags in a sequence is cumbersome and restricted to a computational linguist's understanding of allowable sentence construction in the context of language productivity, I'll instead be taking a stochastic approach to assigning POS tags to words in a sequence through the use of Trigram Hidden Markov Models.

What are Trigram Hidden Markov Models (HMMs)?

The hidden Markov model, or HMM for short, is a probabilistic sequence model that assigns a label to each unit in a sequence of observations (i.e, input sentences). The model computes a probability distribution over possible sequences of POS labels (using a training corpus) and then chooses the best label sequence that maximizes the probability of generating the observed sequence. The HMM is widely used in natural language processing since language consists of sequences at many levels such as sentences, phrases, words, or even characters. The HMM can be enhanced to incorporate not only unobservable parts-of-speech, but also observable components (i.e., the actual order of words in a sequence) through the use of a probability distribution over the set of trigrams in the given corpus. This allows our model to distinguish between homophones, or words that share the same spelling or pronunciation, but differ in meaning and parts-of-speech (i.e., "rose" as in "rose bush" (NN) and "rose" (VBD) as in the past tense of "rise").

This repository details the steps in creating a Part of Speech tagger using Trigram Hidden Markov Models and the Viterbi Algorithm without using external libraries.

Related tags

Overview

POS-Tagger

What is Part-of-Speech Tagging?

What are Trigram Hidden Markov Models (HMMs)?

Owner

Raihan Ahmed

🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

The aim of this task is to predict someone's English proficiency based on a text input.

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

Basic yet complete Machine Learning pipeline for NLP tasks

This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Combating Embedding Barrier in Multilingual Models for Low-Resource Language Understanding".

A repo for open resources & information for people to succeed in PhD in CS & career in AI / NLP

Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Disfl-QA: A Benchmark Dataset for Understanding Disfluencies in Question Answering

A CSRankings-like index for speech researchers

Tool to add main subject to items on Wikidata using a WMFs CirrusSearch for named entity recognition or a manually supplied list of QIDs

ConvBERT-Prod

PyTorch implementation of the paper: Text is no more Enough! A Benchmark for Profile-based Spoken Language Understanding

Sentello is python script that simulates the anti-evasion and anti-analysis techniques used by malware.

Unsupervised Abstract Reasoning for Raven’s Problem Matrices

BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents

Implementation for paper BLEU: a Method for Automatic Evaluation of Machine Translation

Simple program that translates the name of files into English

StarGAN - Official PyTorch Implementation

Malware-Related Sentence Classification

Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.