TweebankNLP - Pre-trained Tweet NLP Pipeline (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Models + Tweebank-NER

Last update: Dec 20, 2022

Overview

TweebankNLP

This repo contains the new Tweebank-NER dataset and Twitter-Stanza pipeline for state-of-the-art Tweet NLP. Tweebank-NER V1.0 is the annotated NER dataset based on Tweebank V2, the main UD treebank for English Twitter NLP tasks. The Twitter-Stanza pipeline provides pre-trained Tweet NLP models (NER, tokenization, lemmatization, POS tagging, dependency parsing) with state-of-the-art or competitive performance. The models are fully compatible with Stanza and provide both Python and command-line interfaces for users.

Installation

# please install from the source
pip install -e .

# download glove and pre-trained models
sh download_twitter_resources.sh

Python Interface for Twitter-Stanza

import stanza

# config for the `en_tweet` pipeline (trained only on Tweebank)
config = {
          'processors': 'tokenize,lemma,pos,depparse,ner',
          'lang': 'en',
          'tokenize_pretokenized': True, # disable tokenization
          'tokenize_model_path': './saved_models/tokenize/en_tweet_tokenizer.pt',
          'lemma_model_path': './saved_models/lemma/en_tweet_lemmatizer.pt',
          "pos_model_path": './saved_models/pos/en_tweet_tagger.pt',
          "depparse_model_path": './saved_models/depparse/en_tweet_parser.pt',
          "ner_model_path": './saved_models/ner/en_tweet_nertagger.pt'
}

# Initialize the pipeline using a configuration dict
nlp = stanza.Pipeline(**config)
doc = nlp("Oh ikr like Messi better than Ronaldo but we all like Ronaldo more")
print(doc) # Look at the result

Running Twitter-Stanza (Command Line Interface)

NER

We provide two pre-trained Stanza NER models:

en_tweenut17: trained on TB2+WNUT17
en_tweet: trained on TB2

source twitter-stanza/scripts/config.sh

python stanza/utils/training/run_ner.py en_tweenut17 \
--mode predict \
--score_test \
--wordvec_file ../data/wordvec/English/en.twitter100d.xz \
--eval_file data/ner/en_tweet.test.json

Syntactic NLP Models

We provide two pre-trained models for the following NLP tasks:

tweet_ewt: trained on TB2+UD-English-EWT
en_tweet: trained on TB2

1. Tokenization

python stanza/utils/training/run_tokenizer.py tweet_ewt \
--mode predict \
--score_test \
--txt_file data/tokenize/en_tweet.test.txt \
--label_file  data/tokenize/en_tweet-ud-test.toklabels \
--no_use_mwt

2. Lemmatization

python stanza/utils/training/run_lemma.py tweet_ewt \
--mode predict \
--score_test \
--gold_file data/depparse/en_tweet.test.gold.conllu \
--eval_file data/depparse/en_tweet.test.in.conllu

3. POS Tagging

python stanza/utils/training/run_pos.py tweet_ewt \
--mode predict \
--score_test \
--eval_file data/pos/en_tweet.test.in.conllu \
--gold_file data/depparse/en_tweet.test.gold.conllu

4. Dependency Parsing

python stanza/utils/training/run_depparse.py tweet_ewt \
--mode predict \
--score_test \
--wordvec_file ../data/wordvec/English/en.twitter100d.txt \
--eval_file data/depparse/en_tweet.test.in.conllu \
--gold_file data/depparse/en_tweet.test.gold.conllu

Training Twitter-Stanza

Please refer to the TRAIN_README.md for training the Twitter-Stanza neural pipeline.

References

If you use this repository in your research, please kindly cite our paper as well as the Stanza papers.

@article{jiang2022tweebank,
    title={Annotating the Tweebank Corpus on Named Entity Recognition and Building NLP Models for Social Media Analysis},
    author={Jiang, Hang and Hua, Yining and Beeferman, Doug and Roy, Deb},
    publisher={arXiv},
    year={2022}
}

Acknowledgement

The Twitter-Stanza pipeline is a friendly fork from the Stanza libaray with a few modifications to adapt to tweets. The repository is fully compatible with Stanza. This research project is funded by MIT Center for Constructive Communication (CCC).

TweebankNLP - Pre-trained Tweet NLP Pipeline (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Models + Tweebank-NER

Related tags

Overview

TweebankNLP

Installation

Python Interface for Twitter-Stanza

Running Twitter-Stanza (Command Line Interface)

NER

Syntactic NLP Models

1. Tokenization

2. Lemmatization

3. POS Tagging

4. Dependency Parsing

Training Twitter-Stanza

References

Acknowledgement

Owner

Laboratory for Social Machines

SAVI2I: Continuous and Diverse Image-to-Image Translation via Signed Attribute Vectors

BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese

TaCL: Improve BERT Pre-training with Token-aware Contrastive Learning

A PyTorch implementation of VIOLET

Code for the Findings of NAACL 2022(Long Paper): AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks

2021海华AI挑战赛·中文阅读理解·技术组·第三名

Code for evaluating Japanese pretrained models provided by NTT Ltd.

Transformers-regression - Regression Bugs Are In Your Model! Measuring, Reducing and Analyzing Regressions In NLP Model Updates

Implementation of some unbalanced loss like focal_loss, dice_loss, DSC Loss, GHM Loss et.al

GPT-3 command line interaction

wxPython app for converting encodings, modifying and fixing SRT files

BERT Attention Analysis

Facilitating the design, comparison and sharing of deep text matching models.

Continuously update some NLP practice based on different tasks.

Long text token classification using LongFormer

Extract city and country mentions from Text like GeoText without regex, but FlashText, a Aho-Corasick implementation.

SHAS: Approaching optimal Segmentation for End-to-End Speech Translation

Natural language Understanding Toolkit

Generate text line images for training deep learning OCR model (e.g. CRNN)

تولید اسم های رندوم فینگیلیش