TweebankNLP - Pre-trained Tweet NLP Pipeline (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Models + Tweebank-NER

Last update: Dec 20, 2022

Overview

TweebankNLP

This repo contains the new Tweebank-NER dataset and Twitter-Stanza pipeline for state-of-the-art Tweet NLP. Tweebank-NER V1.0 is the annotated NER dataset based on Tweebank V2, the main UD treebank for English Twitter NLP tasks. The Twitter-Stanza pipeline provides pre-trained Tweet NLP models (NER, tokenization, lemmatization, POS tagging, dependency parsing) with state-of-the-art or competitive performance. The models are fully compatible with Stanza and provide both Python and command-line interfaces for users.

Installation

# please install from the source
pip install -e .

# download glove and pre-trained models
sh download_twitter_resources.sh

Python Interface for Twitter-Stanza

import stanza

# config for the `en_tweet` pipeline (trained only on Tweebank)
config = {
          'processors': 'tokenize,lemma,pos,depparse,ner',
          'lang': 'en',
          'tokenize_pretokenized': True, # disable tokenization
          'tokenize_model_path': './saved_models/tokenize/en_tweet_tokenizer.pt',
          'lemma_model_path': './saved_models/lemma/en_tweet_lemmatizer.pt',
          "pos_model_path": './saved_models/pos/en_tweet_tagger.pt',
          "depparse_model_path": './saved_models/depparse/en_tweet_parser.pt',
          "ner_model_path": './saved_models/ner/en_tweet_nertagger.pt'
}

# Initialize the pipeline using a configuration dict
nlp = stanza.Pipeline(**config)
doc = nlp("Oh ikr like Messi better than Ronaldo but we all like Ronaldo more")
print(doc) # Look at the result

Running Twitter-Stanza (Command Line Interface)

NER

We provide two pre-trained Stanza NER models:

en_tweenut17: trained on TB2+WNUT17
en_tweet: trained on TB2

source twitter-stanza/scripts/config.sh

python stanza/utils/training/run_ner.py en_tweenut17 \
--mode predict \
--score_test \
--wordvec_file ../data/wordvec/English/en.twitter100d.xz \
--eval_file data/ner/en_tweet.test.json

Syntactic NLP Models

We provide two pre-trained models for the following NLP tasks:

tweet_ewt: trained on TB2+UD-English-EWT
en_tweet: trained on TB2

1. Tokenization

python stanza/utils/training/run_tokenizer.py tweet_ewt \
--mode predict \
--score_test \
--txt_file data/tokenize/en_tweet.test.txt \
--label_file  data/tokenize/en_tweet-ud-test.toklabels \
--no_use_mwt

2. Lemmatization

python stanza/utils/training/run_lemma.py tweet_ewt \
--mode predict \
--score_test \
--gold_file data/depparse/en_tweet.test.gold.conllu \
--eval_file data/depparse/en_tweet.test.in.conllu

3. POS Tagging

python stanza/utils/training/run_pos.py tweet_ewt \
--mode predict \
--score_test \
--eval_file data/pos/en_tweet.test.in.conllu \
--gold_file data/depparse/en_tweet.test.gold.conllu

4. Dependency Parsing

python stanza/utils/training/run_depparse.py tweet_ewt \
--mode predict \
--score_test \
--wordvec_file ../data/wordvec/English/en.twitter100d.txt \
--eval_file data/depparse/en_tweet.test.in.conllu \
--gold_file data/depparse/en_tweet.test.gold.conllu

Training Twitter-Stanza

Please refer to the TRAIN_README.md for training the Twitter-Stanza neural pipeline.

References

If you use this repository in your research, please kindly cite our paper as well as the Stanza papers.

@article{jiang2022tweebank,
    title={Annotating the Tweebank Corpus on Named Entity Recognition and Building NLP Models for Social Media Analysis},
    author={Jiang, Hang and Hua, Yining and Beeferman, Doug and Roy, Deb},
    publisher={arXiv},
    year={2022}
}

Acknowledgement

The Twitter-Stanza pipeline is a friendly fork from the Stanza libaray with a few modifications to adapt to tweets. The repository is fully compatible with Stanza. This research project is funded by MIT Center for Constructive Communication (CCC).

TweebankNLP - Pre-trained Tweet NLP Pipeline (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Models + Tweebank-NER

Related tags

Overview

TweebankNLP

Installation

Python Interface for Twitter-Stanza

Running Twitter-Stanza (Command Line Interface)

NER

Syntactic NLP Models

1. Tokenization

2. Lemmatization

3. POS Tagging

4. Dependency Parsing

Training Twitter-Stanza

References

Acknowledgement

Owner

Laboratory for Social Machines

Code for our paper "Mask-Align: Self-Supervised Neural Word Alignment" in ACL 2021

Experiments in converting wikidata to ftm

Research Code for NeurIPS 2020 Spotlight paper "Large-Scale Adversarial Training for Vision-and-Language Representation Learning": UNITER adversarial training part

Sentello is python script that simulates the anti-evasion and anti-analysis techniques used by malware.

Super Tickets in Pre-Trained Language Models: From Model Compression to Improving Generalization (ACL 2021)

An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

Pretrained language model and its related optimization techniques developed by Huawei Noah's Ark Lab.

⚡ boost inference speed of T5 models by 5x & reduce the model size by 3x using fastT5.

Code for the paper PermuteFormer

End-to-end text to speech system using gruut and onnx. There are 40 voices available across 8 languages.

Python library for parsing resumes using natural language processing and machine learning

👑 spaCy building blocks and visualizers for Streamlit apps

NLP techniques such as named entity recognition, sentiment analysis, topic modeling, text classification with Python to predict sentiment and rating of drug from user reviews.

Practical Machine Learning with Python

Code Implementation of "Learning Span-Level Interactions for Aspect Sentiment Triplet Extraction".

Tool to check whether a GCP bucket is public or not.

TextAttack 🐙 is a Python framework for adversarial attacks, data augmentation, and model training in NLP

CrossNER: Evaluating Cross-Domain Named Entity Recognition (AAAI-2021)

Exploration of BERT-based models on twitter sentiment classifications

Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation