Language Models for the legal domain in Spanish done @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Last update: Nov 14, 2022

Overview

Spanish legal domain Language Model ⚖️

This repository contains the page for two main resources for the Spanish legal domain:

A RoBERTa model: https://huggingface.co/PlanTL-GOB-ES/RoBERTalex
FastText embeddings: https://zenodo.org/record/5036147
Legal corpora: https://zenodo.org/record/5495529

The repository and the pre-print will be updated with larger models, evaluations, etcetera.

Why ❓

There are few models trained for the Spanish language. Some of the models have been trained with a low resource, unclean corpora. The ones derived from the Spanish National Plan for Language Technologies are proficient solving several tasks and have been trained using large scale clean corpora. However, the Spanish Legal domain language could be think of an independent language on its own. We therefore created a Spanish Legal model from scratch trained exclusively on legal corpora.

Evaluation ✅

Work in progress.

Corpora 📃

Corpus name	Size (GB)	Tokens (M)
Procesos Penales	0.625	0.119
JRC Acquis	0.345	59.359
Códigos Electrónicos Universitarios	0.077	11.835
Códigos Electrónicos	0.080	12.237
Doctrina de la Fiscalía General del Estado	0.017	2.669
Legislación BOE	3.600	578.685
Abogacía del Estado BOE	0.037	6.123
Consejo de Estado: Dictámenes	0.827	135.348
Spanish EURLEX	0.001	0.072
UN Resolutions	0.023	3.539
Spanish DOGC	0.826	132.569
Spanish MultiUN	2.200	352.653
Consultas Tributarias Generales y Vinculantes	0.466	77.691
Constitución Española	0.002	0.018
COPPA Patents Corpus	0.002	-
Biomedical Patents	0.083	-

Usage example ⚗️

You can train your model for different downstream tasks using the scripts that Hugging Face provides (Name Entity Recognition, GLUE tasks and others)

from transformers import AutoModelForMaskedLM
from transformers import AutoTokenizer, FillMaskPipeline
from pprint import pprint
tokenizer_hf = AutoTokenizer.from_pretrained('PlanTL-GOB-ES/RoBERTalex')
model = AutoModelForMaskedLM.from_pretrained('PlanTL-GOB-ES/RoBERTalex')
model.eval()
pipeline = FillMaskPipeline(model, tokenizer_hf)
text = f"¡Hola <mask>!"
res_hf = pipeline(text)
pprint([r['token_str'] for r in res_hf])

Cite 📣

If this work is helpful, please cite it:

@misc{gutierrezfandino2021legal,
      title={Spanish Legalese Language Model and Corpora}, 
      author={Asier Gutiérrez-Fandiño and Jordi Armengol-Estapé and Aitor Gonzalez-Agirre and Marta Villegas},
      year={2021},
      eprint={2110.12201},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contact 📧

📋 We are interested in (1) extending our corpora to make larger models (2) evaluate/train the model in other tasks.

For questions regarding this work, contact Asier Gutiérrez-Fandiño ([email protected])

Language Models for the legal domain in Spanish done @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Related tags

Overview

Spanish legal domain Language Model ⚖️

Why ❓

Evaluation ✅

Corpora 📃

Usage example ⚗️

Cite 📣

Contact 📧

Owner

Plan de Tecnologías del Lenguaje - Gobierno de España

RID-Noise: Towards Robust Inverse Design under Noisy Environments

External Attention Network

In this work, we will implement some basic but important algorithm of machine learning step by step.

Example of semantic segmentation in Keras

Homepage of paper: Paint Transformer: Feed Forward Neural Painting with Stroke Prediction, ICCV 2021.

Tensorflow Tutorials using Jupyter Notebook

AdaFocus (ICCV 2021) Adaptive Focus for Efficient Video Recognition

Fog Simulation on Real LiDAR Point Clouds for 3D Object Detection in Adverse Weather

⚡️Optimizing einsum functions in NumPy, Tensorflow, Dask, and more with contraction order optimization.

Code for the paper Relation Prediction as an Auxiliary Training Objective for Improving Multi-Relational Graph Representations (AKBC 2021).

The codes I made while I practiced various TensorFlow examples

Fast, modular reference implementation and easy training of Semantic Segmentation algorithms in PyTorch.

Line-level Handwritten Text Recognition (HTR) system implemented with TensorFlow.

Unsupervised Feature Ranking via Attribute Networks.

The Face Mask recognition system uses AI technology to detect the person with or without a mask.

Fedlearn支持前沿算法研发的Python工具库 | Fedlearn algorithm toolkit for researchers

A PyTorch implementation of ViTGAN based on paper ViTGAN: Training GANs with Vision Transformers.

This is the source code for generating the ASL-Skeleton3D and ASL-Phono datasets. Check out the README.md for more details.

CCP dataset from Clothing Co-Parsing by Joint Image Segmentation and Labeling

StyleMapGAN - Official PyTorch Implementation