The source code of "Language Models are Few-shot Multilingual Learners" (MRL @ EMNLP 2021)

Overview

Language Models are Few-shot Multilingual Learners

License: MIT

Paper

This is the source code of the paper [Arxiv] [ACL Anthology]:

This code has been written using PyTorch. If you use source codes or datasets included in this toolkit in your work, please cite the following paper:

@inproceedings{winata-etal-2021-language,
    title = "Language Models are Few-shot Multilingual Learners",
    author = "Winata, Genta Indra  and
      Madotto, Andrea  and
      Lin, Zhaojiang  and
      Liu, Rosanne  and
      Yosinski, Jason  and
      Fung, Pascale",
    booktitle = "Proceedings of the 1st Workshop on Multilingual Representation Learning",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.mrl-1.1",
    pages = "1--15",
}


Setup Environment

GPU Machine

pip install -r requirements.txt

GPU Machine for Running GPT-J 6B Model

apt install zstd

# the "slim" version contain only bf16 weights and no optimizer parameters, which minimizes bandwidth and memory
wget -c https://the-eye.eu/public/AI/GPT-J-6B/step_383500_slim.tar.zstd

tar -I zstd -xf step_383500_slim.tar.zstd

pip install -r mesh_transformer_jax/requirements.txt

# jax 0.2.12 is required due to a regression with xmap in 0.2.13
pip install mesh-transformer-jax/ jax==0.2.12

# cuda[your_cuda_version]
pip install jaxlib==0.1.67+cuda101 -f https://storage.googleapis.com/jax-releases/jax_releases.html

How to run

Zero-shot Cross-task

❱❱❱ CUDA_VISIBLE_DEVICES=0 python evaluate.py  --dataset snips --model_checkpoint facebook/bart-large-mnli --cuda --length 5 --label_type value --src_lang en --tgt_lang en --seed 42 --use_log_prob --use_confidence --is_cross_task

Finetune

❱❱❱ CUDA_VISIBLE_DEVICES=0 python finetune.py  --dataset snips --model_checkpoint bert-base-multilingual-uncased --cuda --label_type value --src_lang en --tgt_lang en --seed 42 
Owner
Genta Indra Winata
Researcher @ Bloomberg LP. Natural Language Processing, Speech, Multilingual, Code-switching, Dialogue
Genta Indra Winata
Code for ACL 2020 paper "Rigid Formats Controlled Text Generation"

SongNet SongNet: SongCi + Song (Lyrics) + Sonnet + etc. @inproceedings{li-etal-2020-rigid, title = "Rigid Formats Controlled Text Generation",

Piji Li 212 Dec 17, 2022
WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Google Research Datasets 740 Dec 24, 2022
VMD Audio/Text control with natural language

This repository is a proof of principle for performing Molecular Dynamics analysis, in this case with the program VMD, via natural language commands.

Andrew White 13 Jun 09, 2022
Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

Udit Arora 19 Oct 28, 2022
Library for Russian imprecise rhymes generation

TOM RHYMER Library for Russian imprecise rhymes generation. Quick Start Generate rhymes by any given rhyme scheme (aabb, abab, aaccbb, etc ...): from

Alexey Karnachev 6 Oct 18, 2022
Rethinking the Truly Unsupervised Image-to-Image Translation - Official PyTorch Implementation (ICCV 2021)

Rethinking the Truly Unsupervised Image-to-Image Translation (ICCV 2021) Each image is generated with the source image in the left and the average sty

Clova AI Research 436 Dec 27, 2022
Edge-Augmented Graph Transformer

Edge-augmented Graph Transformer Introduction This is the official implementation of the Edge-augmented Graph Transformer (EGT) as described in https:

Md Shamim Hussain 21 Dec 14, 2022
Code for Text Prior Guided Scene Text Image Super-Resolution

Code for Text Prior Guided Scene Text Image Super-Resolution

82 Dec 26, 2022
Code for "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022.

README Code for Two-stage Identifier: "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022. For details of the model a

Yongliang Shen 45 Nov 29, 2022
Sentiment-Analysis and EDA on the IMDB Movie Review Dataset

Sentiment-Analysis and EDA on the IMDB Movie Review Dataset The main part of the work focuses on the exploration and study of different approaches whi

Nikolas Petrou 1 Jan 12, 2022
Unofficial implementation of Google's FNet: Mixing Tokens with Fourier Transforms

FNet: Mixing Tokens with Fourier Transforms Pytorch implementation of Fnet : Mixing Tokens with Fourier Transforms. Citation: @misc{leethorp2021fnet,

Rishikesh (ऋषिकेश) 217 Dec 05, 2022
Text vectorization tool to outperform TFIDF for classification tasks

WHAT: Supervised text vectorization tool Textvec is a text vectorization tool, with the aim to implement all the "classic" text vectorization NLP meth

186 Dec 29, 2022
Submit issues and feature requests for our API here.

AIx GPT API Submit issues and feature requests for our API here. See https://apps.aixsolutionsgroup.com for more info. Python Quick Start pip install

AIx Solutions 7 Mar 27, 2022
This is a general repo that helps you develop fast/effective NLP classifiers using Huggingface

NLP Classifier Introduction This project trains a bert model on any NLP classifcation model. And uses the model in make predictions on new data using

Abdullah Tarek 3 Mar 11, 2022
This repository details the steps in creating a Part of Speech tagger using Trigram Hidden Markov Models and the Viterbi Algorithm without using external libraries.

POS-Tagger This repository details the creation of a Part-of-Speech tagger using Trigram Hidden Markov Models to predict word tags in a word sequence.

Raihan Ahmed 1 Dec 09, 2021
Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks. It takes raw videos/images + text as inputs, and outputs task predictions. ClipB

Jie Lei 雷杰 612 Jan 04, 2023
Quantifiers and Negations in RE Documents

Quantifiers-and-Negations-in-RE-Documents This project was part of my work for a

Nicolas Ruscher 1 Feb 01, 2022
Fuzzy String Matching in Python

FuzzyWuzzy Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

SeatGeek 8.8k Jan 01, 2023
Suite of 500 procedurally-generated NLP tasks to study language model adaptability

TaskBench500 The TaskBench500 dataset and code for generating tasks. Data The TaskBench dataset is available under wget http://web.mit.edu/bzl/www/Tas

Belinda Li 20 May 17, 2022
Code-autocomplete, a code completion plugin for Python

Code AutoComplete code-autocomplete, a code completion plugin for Python.

xuming 13 Jan 07, 2023