Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Last update: Dec 20, 2022

Overview

Spanish Language Models 💃🏻

Corpora 📃

Corpora	Number of documents	Size (GB)
BNE	201,080,084	570GB

Models 🤖

RoBERTa-base BNE: https://huggingface.co/BSC-TeMU/roberta-base-bne
RoBERTa-large BNE: https://huggingface.co/BSC-TeMU/roberta-large-bne
Other models: (WIP)

Word embeddings 🔤

Word embeddings trained with FastText for 300d:

CBOW Word embeddings: https://zenodo.org/record/5044988
Skip-gram Word embeddings: https://zenodo.org/record/5046525

Evaluation ✅

Dataset	Metric	RoBERTa-b	RoBERTa-l	BETO	mBERT	BERTIN
UD-POS	F1	0.9907	0.9901	0.9900	0.9886	0.9904
Conll-NER	F1	0.8851	0.8772	0.8759	0.8691	0.8627
Capitel-POS	F1	0.9846	0.9851	0.9836	0.9839	0.9826
Capitel-NER	F1	0.8959	0.8998	0.8771	0.8810	0.8741
STS	Combined	0.8423	0.8420	0.8216	0.8249	0.7822
MLDoc	Accuracy	0.9595	0.9600	0.9650	0.9560	0.9673
PAWS-X	F1	0.9035	0.9000	0.8915	0.9020	0.8820
XNLI	Accuracy	0.8016	WiP	0.8130	0.7876	WiP

Usage example ⚗️

For the RoBERTa-base

from transformers import AutoModelForMaskedLM
from transformers import AutoTokenizer, FillMaskPipeline
from pprint import pprint
tokenizer_hf = AutoTokenizer.from_pretrained('BSC-TeMU/roberta-base-bne')
model = AutoModelForMaskedLM.from_pretrained('BSC-TeMU/roberta-base-bne')
model.eval()
pipeline = FillMaskPipeline(model, tokenizer_hf)
text = f"¡Hola <mask>!"
res_hf = pipeline(text)
pprint([r['token_str'] for r in res_hf])

For the RoBERTa-large

from transformers import AutoModelForMaskedLM
from transformers import AutoTokenizer, FillMaskPipeline
from pprint import pprint
tokenizer_hf = AutoTokenizer.from_pretrained('BSC-TeMU/roberta-large-bne')
model = AutoModelForMaskedLM.from_pretrained('BSC-TeMU/roberta-large-bne')
model.eval()
pipeline = FillMaskPipeline(model, tokenizer_hf)
text = f"¡Hola <mask>!"
res_hf = pipeline(text)
pprint([r['token_str'] for r in res_hf])

Other Spanish Language Models 👩‍👧‍👦

We are developing domain-specific language models:

Legal Language Model

Cite 📣

@misc{gutierrezfandino2021spanish,
      title={Spanish Language Models}, 
      author={Asier Gutiérrez-Fandiño and Jordi Armengol-Estapé and Marc Pàmies and Joan Llop-Palao and Joaquín Silveira-Ocampo and Casimiro Pio Carrino and Aitor Gonzalez-Agirre and Carme Armentano-Oller and Carlos Rodriguez-Penagos and Marta Villegas},
      year={2021},
      eprint={2107.07253},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contact 📧

📋 We are interested in (1) extending our corpora to make larger models (2) train/evaluate the model in other tasks.

For questions regarding this work, contact Asier Gutiérrez-Fandiño ([email protected])

Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Related tags

Overview

Spanish Language Models 💃🏻

Corpora 📃

Models 🤖

Word embeddings 🔤

Evaluation ✅

Usage example ⚗️

Other Spanish Language Models 👩‍👧‍👦

Cite 📣

Contact 📧

Owner

PlanTL-SANIDAD

Recognition of 38 speech commands in russian. Based on Yandex Cup 2021 ML Challenge: ASR

Huggingface Transformers + Adapters = ❤️

Задания КЕГЭ по информатике 2021 на Python

This repository contains the code for "Generating Datasets with Pretrained Language Models".

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents

Applying "Load What You Need: Smaller Versions of Multilingual BERT" to LaBSE

Pytorch NLP library based on FastAI

Learn meanings behind words is a key element in NLP. This project concentrates on the disambiguation of preposition senses. Therefore, we train a bert-transformer model and surpass the state-of-the-art.

🛸 Use pretrained transformers like BERT, XLNet and GPT-2 in spaCy

Code for the paper "BERT Loses Patience: Fast and Robust Inference with Early Exit".

基于pytorch+bert的中文事件抽取

Simple Annotated implementation of GPT-NeoX in PyTorch

내부 작업용 django + vue(vuetify) boilerplate. 짠 하면 돌아감.

PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

All the code I wrote for Overwatch-related projects that I still own the rights to.

pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation

Extract city and country mentions from Text like GeoText without regex, but FlashText, a Aho-Corasick implementation.

TextAttack 🐙 is a Python framework for adversarial attacks, data augmentation, and model training in NLP

A library that integrates huggingface transformers with the world of fastai, giving fastai devs everything they need to train, evaluate, and deploy transformer specific models.

Trained T5 and T5-large model for creating keywords from text