A pre-trained language model for social media text in Spanish

Last update: Dec 29, 2022

Related tags

Overview

RoBERTuito

A pre-trained language model for social media text in Spanish

READ THE FULL PAPER Github Repository

RoBERTuito is a pre-trained language model for user-generated content in Spanish, trained following RoBERTa guidelines on 500 million tweets. RoBERTuito comes in 3 flavors: cased, uncased, and uncased+deaccented.

We tested RoBERTuito on a benchmark of tasks involving user-generated text in Spanish. It outperforms other pre-trained language models for this language such as BETO, BERTin and RoBERTa-BNE. The 4 tasks selected for evaluation were: Hate Speech Detection (using SemEval 2019 Task 5, HatEval dataset), Sentiment and Emotion Analysis (using TASS 2020 datasets), and Irony detection (using IrosVa 2019 dataset).

model	hate speech	sentiment analysis	emotion analysis	irony detection	score
robertuito-uncased	0.801 ± 0.010	0.707 ± 0.004	0.551 ± 0.011	0.736 ± 0.008	0.699
robertuito-deacc	0.798 ± 0.008	0.702 ± 0.004	0.543 ± 0.015	0.740 ± 0.006	0.696
robertuito-cased	0.790 ± 0.012	0.701 ± 0.012	0.519 ± 0.032	0.719 ± 0.023	0.682
roberta-bne	0.766 ± 0.015	0.669 ± 0.006	0.533 ± 0.011	0.723 ± 0.017	0.673
bertin	0.767 ± 0.005	0.665 ± 0.003	0.518 ± 0.012	0.716 ± 0.008	0.667
beto-cased	0.768 ± 0.012	0.665 ± 0.004	0.521 ± 0.012	0.706 ± 0.007	0.665
beto-uncased	0.757 ± 0.012	0.649 ± 0.005	0.521 ± 0.006	0.702 ± 0.008	0.657

We release the pre-trained models on huggingface model hub:

Usage

IMPORTANT -- READ THIS FIRST

RoBERTuito is not yet fully-integrated into huggingface/transformers. To use it, first install pysentimiento

pip install pysentimiento

and preprocess text using pysentimiento.preprocessing.preprocess_tweet before feeding it into the tokenizer

','▁Esto','▁es','▁un','▁tweet','▁estoy','▁usando','▁','▁hashtag','▁','▁ro','bert','uito','▁@usuario','▁','▁emoji','▁cara','▁revolviéndose','▁de','▁la','▁risa','▁emoji',''] ">

from transformers import AutoTokenizer
from pysentimiento.preprocessing import preprocess_tweet

tokenizer = AutoTokenizer.from_pretrained('pysentimiento/robertuito-base-cased')

text = "Esto es un tweet estoy usando #Robertuito @pysentimiento 🤣"
preprocessed_text = preprocess_tweet(text, ha)

tokenizer.tokenize(preprocessed_text)
# ['','▁Esto','▁es','▁un','▁tweet','▁estoy','▁usando','▁','▁hashtag','▁','▁ro','bert','uito','▁@usuario','▁','▁emoji','▁cara','▁revolviéndose','▁de','▁la','▁risa','▁emoji','']

We are working on integrating this preprocessing step into a Tokenizer within transformers library

Development

Installing

We use python==3.7 and poetry to manage dependencies.

pip install poetry
poetry install

Benchmarking

To run benchmarks

python bin/run_benchmark.py <model_name> --times 5 --output_path <output_path>

Check RUN_BENCHMARKS for all experiments

Smoke test

Test the benchmark running

./smoke_test.sh

Citation

If you use RoBERTuito, please cite our paper:

@misc{perez2021robertuito,
      title={RoBERTuito: a pre-trained language model for social media text in Spanish},
      author={Juan Manuel Pérez and Damián A. Furman and Laura Alonso Alemany and Franco Luque},
      year={2021},
      eprint={2111.09453},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

A pre-trained language model for social media text in Spanish

Related tags

Overview

RoBERTuito

A pre-trained language model for social media text in Spanish

Usage

Development

Installing

Benchmarking

Smoke test

Citation

Owner

Pytorch implementation of Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

Block-wisely Supervised Neural Architecture Search with Knowledge Distillation (CVPR 2020)

A unified framework for machine learning with time series

LSTM and QRNN Language Model Toolkit for PyTorch

Code for Towards Unifying Behavioral and Response Diversity for Open-ended Learning in Zero-sum Games

PyTorch implementation of "PatchGame: Learning to Signal Mid-level Patches in Referential Games" to appear in NeurIPS 2021

Repository aimed at compiling code, papers, demos etc.. related to my PhD on 3D vision and machine learning for fruit detection and shape estimation at the university of Lincoln

Object detection evaluation metrics using Python.

Official implementation of "UCTransNet: Rethinking the Skip Connections in U-Net from a Channel-wise Perspective with Transformer"

Re-implementation of the vector capsule with dynamic routing

Official code of our work, AVATAR: A Parallel Corpus for Java-Python Program Translation.

Code for "Adversarial Attack Generation Empowered by Min-Max Optimization", NeurIPS 2021

TextWorld is a sandbox learning environment for the training and evaluation of reinforcement learning (RL) agents on text-based games.

Code for Dual Contrastive Learning for Unsupervised Image-to-Image Translation, NTIRE, CVPRW 2021.

Fortuitous Forgetting in Connectionist Networks

[ICLR 2022] DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR

Implementations of orthogonal and semi-orthogonal convolutions in the Fourier domain with applications to adversarial robustness

Steer OpenAI's Jukebox with Music Taggers

pytorchのスライス代入操作をonnxに変換する際にScatterNDならないようにするサンプル

Trying to understand alias-free-gan.

A pre-trained language model for social media text in Spanish

Related tags

Overview

RoBERTuito

A pre-trained language model for social media text in Spanish

Usage

Development

Installing

Benchmarking

Smoke test

Citation

Owner

Pytorch implementation of Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

Block-wisely Supervised Neural Architecture Search with Knowledge Distillation (CVPR 2020)

A unified framework for machine learning with time series

LSTM and QRNN Language Model Toolkit for PyTorch

Code for Towards Unifying Behavioral and Response Diversity for Open-ended Learning in Zero-sum Games

PyTorch implementation of "PatchGame: Learning to Signal Mid-level Patches in Referential Games" to appear in NeurIPS 2021

Repository aimed at compiling code, papers, demos etc.. related to my PhD on 3D vision and machine learning for fruit detection and shape estimation at the university of Lincoln

Object detection evaluation metrics using Python.

Official implementation of "UCTransNet: Rethinking the Skip Connections in U-Net from a Channel-wise Perspective with Transformer"

Re-implementation of the vector capsule with dynamic routing

Official code of our work, AVATAR: A Parallel Corpus for Java-Python Program Translation.

Code for "Adversarial Attack Generation Empowered by Min-Max Optimization", NeurIPS 2021

​TextWorld is a sandbox learning environment for the training and evaluation of reinforcement learning (RL) agents on text-based games.

Code for Dual Contrastive Learning for Unsupervised Image-to-Image Translation, NTIRE, CVPRW 2021.

Fortuitous Forgetting in Connectionist Networks

[ICLR 2022] DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR

Implementations of orthogonal and semi-orthogonal convolutions in the Fourier domain with applications to adversarial robustness

Steer OpenAI's Jukebox with Music Taggers

pytorchのスライス代入操作をonnxに変換する際にScatterNDならないようにするサンプル

Trying to understand alias-free-gan.

TextWorld is a sandbox learning environment for the training and evaluation of reinforcement learning (RL) agents on text-based games.