A single model that parses Universal Dependencies across 75 languages.

Overview

UDify

MIT License

UDify is a single model that parses Universal Dependencies (UPOS, UFeats, Lemmas, Deps) jointly, accepting any of 75 supported languages as input (trained on UD v2.3 with 124 treebanks). This repository accompanies the paper, "75 Languages, 1 Model: Parsing Universal Dependencies Universally," providing tools to train a multilingual model capable of parsing any Universal Dependencies treebank with high accuracy. This project also supports training and evaluating for the SIGMORPHON 2019 Shared Task #2, which achieved 1st place in morphology tagging (paper can be found here).

Integration with SpaCy is supported by Camphr.

UDify Model Architecture

The project is built using AllenNLP and PyTorch.

Getting Started

Install the Python packages in requirements.txt. UDify depends on AllenNLP and PyTorch. For Windows OS, use WSL. Optionally, install TensorFlow to get access to TensorBoard to get a rich visualization of model performance on each UD task.

pip install -r ./requirements.txt

Download the UD corpus by running the script

bash ./scripts/download_ud_data.sh

or alternatively download the data from universaldependencies.org and extract into data/ud-treebanks-v2.3/, then run scripts/concat_ud_data.sh to generate the multilingual UD dataset.

Training the Model

Before training, make sure the dataset is downloaded and extracted into the data directory and the multilingual dataset is generated with scripts/concat_ud_data.sh. To train the multilingual model (fine-tune UD on BERT), run the command

python train.py --config config/ud/multilingual/udify_bert_finetune_multilingual.json --name multilingual

which will begin loading the dataset and model before training the network. The model metrics, vocab, and weights will be saved under logs/multilingual. Note that this process is highly memory intensive and requires 16+ GB of RAM and 12+ GB of GPU memory (requirements are half if fp16 is enabled in AllenNLP, but this requires custom changes to the library). The training may take 20 or more days to complete all 80 epochs depending on the type of your GPU.

Training on Other Datasets

An example config is given for fine-tuning on just English EWT. Just run:

python train.py --config config/ud/en/udify_bert_finetune_en_ewt.json --name en_ewt --dataset_dir data/ud-treebanks-v2.3/

To run your own dataset, copy config/ud/multilingual/udify_bert_finetune_multilingual.json and modify the following json parameters:

  • train_data_path, validation_data_path, and test_data_path to the paths of the dataset conllu files. These can be optionally null.
  • directory_path to data/vocab/ /vocabulary .
  • warmup_steps and start_step to be equal to the number of steps in the first epoch. A good initial value is in the range 100-1000. Alternatively, run the training script first to see the number of steps to the right of the progress bar.
  • If using just one treebank, optionally add xpos to the tasks list.

Viewing Model Performance

One can view how well the models are performing by running TensorBoard

tensorboard --logdir logs

This should show the currently trained model as well as any other previously trained models. The model will be stored in a folder specified by the --name parameter as well as a date stamp, e.g., logs/multilingual/2019.07.03_11.08.51.

Pretrained Models

Pretrained models can be found here. This can be used for predicting conllu annotations or for fine-tuning. The link contains the following:

  • udify-model.tar.gz - The full UDify model archive that can be used for prediction with predict.py. Note that this model has been trained for extra epochs, and may differ slightly from the model shown in the original research paper.
  • udify-bert.tar.gz - The extracted BERT weights from the UDify model, in huggingface transformers (pytorch-pretrained-bert) format.

Predicting Universal Dependencies from a Trained Model

To predict UD annotations, one can supply the path to the trained model and an input conllu-formatted file:

python predict.py <archive> <input.conllu> <output.conllu> [--eval_file results.json]

For instance, predicting the dev set of English EWT with the trained model saved under logs/model.tar.gz and UD treebanks at data/ud-treebanks-v2.3 can be done with

python predict.py logs/model.tar.gz  data/ud-treebanks-v2.3/UD_English-EWT/en_ewt-ud-dev.conllu logs/pred.conllu --eval_file logs/pred.json

and will save the output predictions to logs/pred.conllu and evaluation to logs/pred.json.

Configuration Options

  1. One can specify the type of device to run on. For a single GPU, use the flag --device 0, or --device -1 for CPU.
  2. To skip waiting for the dataset to be fully loaded into memory, use the flag --lazy. Note that the dataset won't be shuffled.
  3. Resume an existing training run with --resume .
  4. Specify a config file with --config .

SIGMORPHON 2019 Shared Task

A modification to the basic UDify model is available for parsing morphology in the SIGMORPHON 2019 Shared Task #2. The following paper describes the model in more detail: "Cross-Lingual Lemmatization and Morphology Tagging with Two-Stage Multilingual BERT Fine-Tuning".

Training is similar to UD, just run download_sigmorphon_data.sh and then use the configuration file under config/sigmorphon/multilingual, e.g.,

python train.py --config config/sigmorphon/multilingual/udify_bert_sigmorphon_multilingual.json --name sigmorphon

FAQ

  1. When fine-tuning, my scores/metrics show poor performance.

It should take about 10 epochs to start seeing good scores coming from all the metrics, and 80 epochs to be competitive with UDPipe Future.

One caveat is that if you use a subset of treebanks for fine-tuning instead of all 124 UD v2.3 treebanks, you must modify the configuration file. Make sure to tune the learning rate scheduler to the number of training steps. Copy the udify_bert_finetune_multilingual.json config and modify the "warmup_steps" and "start_step" values. A good initial choice would be to set both to be equal to the number of training batches of one epoch (run the training script first to see the batches remaining, to the right of the progress bar).

Have a question not listed here? Open a GitHub Issue.

Citing This Research

If you use UDify for your research, please cite this work as:

@inproceedings{kondratyuk-straka-2019-75,
    title = {75 Languages, 1 Model: Parsing Universal Dependencies Universally},
    author = {Kondratyuk, Dan and Straka, Milan},
    booktitle = {Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)},
    year = {2019},
    address = {Hong Kong, China},
    publisher = {Association for Computational Linguistics},
    url = {https://www.aclweb.org/anthology/D19-1279},
    pages = {2779--2795}
}
Owner
Dan Kondratyuk
Machine Learning, NLP, and Computer Vision. I love a fresh challenge—be it a math problem, a physics puzzle, or programming quandary.
Dan Kondratyuk
The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models

Graformer The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models Graformer (also named BridgeTransformer in t

22 Dec 14, 2022
AI_Assistant - This is a Python based Voice Assistant.

This is a Python based Voice Assistant. This was programmed to increase my understanding of python and also how the in-general Voice Assistants work.

1 Jan 06, 2022
Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

Michael Petrochuk 2.1k Jan 01, 2023
Spooky Skelly For Python

_____ _ _____ _ _ _ | __| ___ ___ ___ | |_ _ _ | __|| |_ ___ | || | _ _ |__ || . || . || . || '

Kur0R1uka 1 Dec 23, 2021
A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends. Flair is: A powerful NLP library. Flair allo

flair 12.3k Jan 02, 2023
Finally, some decent sample sentences

tts-dataset-prompts This repository aims to be a decent set of sentences for people looking to clone their own voices (e.g. using Tacotron 2). Each se

hecko 19 Dec 13, 2022
PG-19 Language Modelling Benchmark

PG-19 Language Modelling Benchmark This repository contains the PG-19 language modeling benchmark. It includes a set of books extracted from the Proje

DeepMind 161 Oct 30, 2022
Extract city and country mentions from Text like GeoText without regex, but FlashText, a Aho-Corasick implementation.

flashgeotext ⚡ 🌍 Extract and count countries and cities (+their synonyms) from text, like GeoText on steroids using FlashText, a Aho-Corasick impleme

Ben 57 Dec 16, 2022
Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks

Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks, which modifies the input text with a textual template and directly uses PLMs to conduct pre

THUNLP 2.3k Jan 08, 2023
Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Data Augmentation using Pre-trained Transformer Models Code associated with the Data Augmentation using Pre-trained Transformer Models paper Code cont

44 Dec 31, 2022
spaCy plugin for Transformers , Udify, ELmo, etc.

Camphr - spaCy plugin for Transformers, Udify, Elmo, etc. Camphr is a Natural Language Processing library that helps in seamless integration for a wid

342 Nov 21, 2022
FB ID CLONER WUTHOT CHECKPOINT, FACEBOOK ID CLONE FROM FILE

* MY SOCIAL MEDIA : Programming And Memes Want to contact Mr. Error ? CONTACT : [ema

Mr. Error 9 Jun 17, 2021
Extract Keywords from sentence or Replace keywords in sentences.

FlashText This module can be used to replace keywords in sentences or extract keywords from sentences. It is based on the FlashText algorithm. Install

Vikash Singh 5.3k Jan 01, 2023
This repo contains simple to use, pretrained/training-less models for speaker diarization.

PyDiar This repo contains simple to use, pretrained/training-less models for speaker diarization. Supported Models Binary Key Speaker Modeling Based o

12 Jan 20, 2022
Task-based datasets, preprocessing, and evaluation for sequence models.

SeqIO: Task-based datasets, preprocessing, and evaluation for sequence models. SeqIO is a library for processing sequential data to be fed into downst

Google 290 Dec 26, 2022
A text augmentation tool for named entity recognition.

neraug This python library helps you with augmenting text data for named entity recognition. Augmentation Example Reference from An Analysis of Simple

Hiroki Nakayama 48 Oct 11, 2022
My implementation of Safaricom Machine Learning Codility test. The code has bugs, logical I guess I made errors and any correction will be appreciated.

Safaricom_Codility Machine Learning 2022 The test entails two questions. Question 1 was on Machine Learning. Question 2 was on SQL I ran out of time.

Lawrence M. 1 Mar 03, 2022
Simple Python script to scrape youtube channles of "Parity Technologies and Web3 Foundation" and translate them to well-known braille language or any language

Simple Python script to scrape youtube channles of "Parity Technologies and Web3 Foundation" and translate them to well-known braille language or any

Little Endian 1 Apr 28, 2022
TEACh is a dataset of human-human interactive dialogues to complete tasks in a simulated household environment.

TEACh is a dataset of human-human interactive dialogues to complete tasks in a simulated household environment.

Alexa 98 Dec 09, 2022
A notebook that shows how to import the IITB English-Hindi Parallel Corpus from the HuggingFace datasets repository

We provide a notebook that shows how to import the IITB English-Hindi Parallel Corpus from the HuggingFace datasets repository. The notebook also shows how to segment the corpus using BPE tokenizatio

Computation for Indian Language Technology (CFILT) 9 Oct 13, 2022