CoNLL-English NER Task

en | ch

Motivation

Course Project
review the pytorch framework and sequence-labeling task
practice using the transformers of Huggingface

Dataset Introduction

A train set, a test set and a validation set in the data file

-DOCSTART- -X- O O
-sentnce- -pos- -Chuck- -Entity-

Project Structure

-data  # source data
-emb # BERT model files

-util
    -dataTool.py  # data interface
    -model.py
    -trainer.py  # train and evaluate

config.py  # parameters in the project
run.py
requirement.txt

EDA.ipynb # exploratory data analasis, 
          # which aims to confirm the hyper-params in the trials

Coding Pattern

For keeping the convenience and simplicity of experiments,
decouple the model into two units: encoder and tagger

model ==> encoder + tagger

In such a way, encoder extracts the context and linguistit features,
which will be received by tagger to output BIO tags.

Usage

chmod 755 deploy
./deploy

./gpu n  # monitor the GPU (refresh every n seconds)
./run  # start

Baseline Performance (1 ep | macro)

Model	Precision	Recall	F1
Bert-CRF	0.71	0.68	0.69
Bert-softmax	-	-	-
Bert-BiLSTM-CRF	-	-	-
Bert-BiLSTM-softmax	-	-	-

Optimization

cost sensitive learning or drop the few classes
dropout to improve the generalization performance
different backbone structures
DDP training --> large GPU caches for a large batch_size
more epochs --> schedule the learning rate dynamically while training

CoNLL-English NER Task (NER in English)

Related tags

Overview

CoNLL-English NER Task

Motivation

Dataset Introduction

Project Structure

Coding Pattern

Usage

Baseline Performance (1 ep | macro)

Optimization

Owner

Kevin

NeuTex: Neural Texture Mapping for Volumetric Neural Rendering

Twitter-Sentiment-Analysis - Twitter sentiment analysis for india's top online retailers(2019 to 2022)

An easy-to-use Python module that helps you to extract the BERT embeddings for a large text dataset (Bengali/English) efficiently.

This repository details the steps in creating a Part of Speech tagger using Trigram Hidden Markov Models and the Viterbi Algorithm without using external libraries.

This project consists of data analysis and data visualization (done using python)of all IPL seasons from 2008 to 2019 and answering the most asked questions about the IPL.

Code for "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022.

Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition

Journalism AI – Quotes extraction for modular journalism

A Python/Pytorch app for easily synthesising human voices

this repository has datasets containing information of Uber pickups in NYC from April 2014 to September 2014 and January to June 2015. data Analysis , virtualization and some insights are gathered here

A practical and feature-rich paraphrasing framework to augment human intents in text form to build robust NLU models for conversational engines. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

中文問句產生器；使用台達電閱讀理解資料集(DRCD)

SimCTG - A Contrastive Framework for Neural Text Generation

Python package for performing Entity and Text Matching using Deep Learning.

Statistics and Mathematics for Machine Learning, Deep Learning , Deep NLP

nlp-tutorial is a tutorial for who is studying NLP(Natural Language Processing) using Pytorch

DomainWordsDict, Chinese words dict that contains more than 68 domains, which can be used as text classification、knowledge enhance task

SEJE is a prototype for the paper Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval with Deep Feature Engineering.

An open-source NLP research library, built on PyTorch.

A program that uses real statistics to choose the best times to bet on BloxFlip's crash gamemode