CoNLL-English NER Task

en | ch

Motivation

Course Project
review the pytorch framework and sequence-labeling task
practice using the transformers of Huggingface

Dataset Introduction

A train set, a test set and a validation set in the data file

-DOCSTART- -X- O O
-sentnce- -pos- -Chuck- -Entity-

Project Structure

-data  # source data
-emb # BERT model files

-util
    -dataTool.py  # data interface
    -model.py
    -trainer.py  # train and evaluate

config.py  # parameters in the project
run.py
requirement.txt

EDA.ipynb # exploratory data analasis, 
          # which aims to confirm the hyper-params in the trials

Coding Pattern

For keeping the convenience and simplicity of experiments,
decouple the model into two units: encoder and tagger

model ==> encoder + tagger

In such a way, encoder extracts the context and linguistit features,
which will be received by tagger to output BIO tags.

Usage

chmod 755 deploy
./deploy

./gpu n  # monitor the GPU (refresh every n seconds)
./run  # start

Baseline Performance (1 ep | macro)

Model	Precision	Recall	F1
Bert-CRF	0.71	0.68	0.69
Bert-softmax	-	-	-
Bert-BiLSTM-CRF	-	-	-
Bert-BiLSTM-softmax	-	-	-

Optimization

cost sensitive learning or drop the few classes
dropout to improve the generalization performance
different backbone structures
DDP training --> large GPU caches for a large batch_size
more epochs --> schedule the learning rate dynamically while training

CoNLL-English NER Task (NER in English)

Related tags

Overview

CoNLL-English NER Task

Motivation

Dataset Introduction

Project Structure

Coding Pattern

Usage

Baseline Performance (1 ep | macro)

Optimization

Owner

Kevin

A collection of Korean Text Datasets ready to use using Tensorflow-Datasets.

SimCSE: Simple Contrastive Learning of Sentence Embeddings

PyWorld3 is a Python implementation of the World3 model

Python generation script for BitBirds

A library for Multilingual Unsupervised or Supervised word Embeddings

Text Analysis & Topic Extraction on Android App user reviews

OCR을 이용하여 인원수를 인식 후 줌을 Kill 해줍니다

Source code for CsiNet and CRNet using Fully Connected Layer-Shared feedback architecture.

Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

GPT-2 Model for Leetcode Questions in python

Official PyTorch Implementation of paper "NeLF: Neural Light-transport Field for Single Portrait View Synthesis and Relighting", EGSR 2021.

ETM - R package for Topic Modelling in Embedding Spaces

Open source code for AlphaFold.

This is a MD5 password/passphrase brute force tool

MMDA - multimodal document analysis

LCG T-TEST USING EUCLIDEAN METHOD

In this project, we compared Spanish BERT and Multilingual BERT in the Sentiment Analysis task.

The projects lets you extract glossary words and their definitions from a given piece of text automatically using NLP techniques

Simple, hackable offline speech to text - using the VOSK-API.

Fastseq 基于ONNXRUNTIME的文本生成加速框架