pysentimiento: A Python toolkit for Sentiment Analysis and Social NLP tasks

Last update: Dec 29, 2022

Overview

pysentimiento: A Python toolkit for Sentiment Analysis and Social NLP tasks

A Transformer-based library for SocialNLP classification tasks.

Currently supports:

Sentiment Analysis (Spanish, English)
Emotion Analysis (Spanish, English)

Just do pip install pysentimiento and start using it:

from pysentimiento import SentimentAnalyzer
analyzer = SentimentAnalyzer(lang="es")

analyzer.predict("Qué gran jugador es Messi")
# returns SentimentOutput(output=POS, probas={POS: 0.998, NEG: 0.002, NEU: 0.000})
analyzer.predict("Esto es pésimo")
# returns SentimentOutput(output=NEG, probas={NEG: 0.999, POS: 0.001, NEU: 0.000})
analyzer.predict("Qué es esto?")
# returns SentimentOutput(output=NEU, probas={NEU: 0.993, NEG: 0.005, POS: 0.002})

analyzer.predict("jejeje no te creo mucho")
# SentimentOutput(output=NEG, probas={NEG: 0.587, NEU: 0.408, POS: 0.005})
"""
Emotion Analysis in English
"""

emotion_analyzer = EmotionAnalyzer(lang="en")

emotion_analyzer.predict("yayyy")
# returns EmotionOutput(output=joy, probas={joy: 0.723, others: 0.198, surprise: 0.038, disgust: 0.011, sadness: 0.011, fear: 0.010, anger: 0.009})
emotion_analyzer.predict("fuck off")
# returns EmotionOutput(output=anger, probas={anger: 0.798, surprise: 0.055, fear: 0.040, disgust: 0.036, joy: 0.028, others: 0.023, sadness: 0.019})

Also, you might use pretrained models directly with transformers library.

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("finiteautomata/beto-sentiment-analysis")

model = AutoModelForSequenceClassification.from_pretrained("finiteautomata/beto-sentiment-analysis")

Preprocessing

pysentimiento features a tweet preprocessor specially suited for tweet classification with transformer-based models.

from pysentimiento.preprocessing import preprocess_tweet

# Replaces user handles and URLs by special tokens
preprocess_tweet("@perezjotaeme debería cambiar esto http://bit.ly/sarasa") # "@usuario debería cambiar esto url"

# Shortens repeated characters
preprocess_tweet("no entiendo naaaaaaaadaaaaaaaa", shorten=2) # "no entiendo naadaa"

# Normalizes laughters
preprocess_tweet("jajajajaajjajaajajaja no lo puedo creer ajajaj") # "jaja no lo puedo creer jaja"

# Handles hashtags
preprocess_tweet("esto es #UnaGenialidad")
# "esto es una genialidad"

# Handles emojis
preprocess_tweet("🎉🎉", lang="en")
# 'emoji party popper emoji emoji party popper emoji'

Trained models so far

Check CLASSIFIERS.md for details on the reported performances of each model.

Spanish models

English models

Instructions for developers

First, download TASS 2020 data to data/tass2020 (you have to register here to download the dataset)

Labels must be placed under data/tass2020/test1.1/labels

Run script to train models

Check TRAIN_EVALUATE.md

Upload models to Huggingface's Model Hub

Check "Model sharing and upload" instructions in huggingface docs.

License

pysentimiento is an open-source library. However, please be aware that models are trained with third-party datasets and are subject to their respective licenses, many of which are for non-commercial use

TASS Dataset license (License for Sentiment Analysis in Spanish, Emotion Analysis in Spanish & English)
SEMEval 2017 Dataset license (Sentiment Analysis in English)

Citation

If you use pysentimiento in your work, please cite this paper

@misc{perez2021pysentimiento,
      title={pysentimiento: A Python Toolkit for Sentiment Analysis and SocialNLP tasks},
      author={Juan Manuel Pérez and Juan Carlos Giudici and Franco Luque},
      year={2021},
      eprint={2106.09462},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

TODO:

Upload some other models
Train in other languages

Suggestions and bugfixes

Please use the repository issue tracker to point out bugs and make suggestions (new models, use another datasets, some other languages, etc)

pysentimiento: A Python toolkit for Sentiment Analysis and Social NLP tasks

Related tags

Overview

pysentimiento: A Python toolkit for Sentiment Analysis and Social NLP tasks

Preprocessing

Trained models so far

Spanish models

English models

Instructions for developers

License

Citation

TODO:

Suggestions and bugfixes

Owner

Optimal Transport Tools (OTT), A toolbox for all things Wasserstein.

Code for the paper in Findings of EMNLP 2021: "EfficientBERT: Progressively Searching Multilayer Perceptron via Warm-up Knowledge Distillation".

Machine learning models from Singapore's NLP research community

Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration

Anuvada: Interpretable Models for NLP using PyTorch

Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS)

Meta learning algorithms to train cross-lingual NLI (multi-task) models

Ray-based parallel data preprocessing for NLP and ML.

Awesome-NLP-Research (ANLP)

Speech Recognition for Uyghur using Speech transformer

Python implementation of TextRank for phrase extraction and summarization of text documents

Blender addon - Scrub timeline from viewport with a shortcut

2021海华AI挑战赛·中文阅读理解·技术组·第三名

CPC-big and k-means clustering for zero-resource speech processing

Pipeline for training LSA models using Scikit-Learn.

Predict an emoji that is associated with a text

Repository for fine-tuning Transformers 🤗 based seq2seq speech models in JAX/Flax.

A versatile token stream for handwritten parsers.

Conversational-AI-ChatBot - Intelligent ChatBot built with Microsoft's DialoGPT transformer to make conversations with human users!

The SVO-Probes Dataset for Verb Understanding