This repository serves as a place to document a toy attempt on how to create a generative text model in Catalan, based on GPT-2

Overview

GPT-2 in Catalan

This repository serves as a place to document a toy attempt on how to create a generative text model in Catalan, based on GPT-2. In other words... this is more of a prototype and a personal playground than a serious attempt to have a fully functional GPT-2 in Catalan.

Nevertheless, I hope this can also help someone else train their own GPT-2 model and provide some pointers on how to do so.

Suggestions and constructive criticism are always welcome!

1. GPT-2 ๐Ÿ“

1.1. What is GPT-2 โ“

GPT-2 (GPT-2 stands for Generative Pre-trained Transformer 2) is a transformer-based language model trained in large volumes of data and was not trained with a specific task in mind. Nevertheless, it has probably been used mostly for generating new text.

A better and further explanation can be found here (http://jalammar.github.io/illustrated-gpt2/).

1.2. Why GPT-2 โ”

It is undeniable that GPT-2 played a large role and became very popular when it came out. It has also created some controversy. These aside, GPT-2 acted as a big step forward in terms of generating texts... And is also "faster" to train on custom data than its next generation sibling, GPT-3.

2. Training ๐Ÿ”จ

2.1. Requirements ๐Ÿ“Ž

You will need a powerful GPU or reduce the batch size. You can also use a VM from a Cloud service such as Google Colab or Microsoft Azure.

2.2. Training Script ๐Ÿ“ˆ

The training is implemented in the train_GPT2.py script, which serves as a skeleton. You can run it from the Commandline and passing all the arguments.

e.g.

cd src
./train_GPT2.py \
    --model DeepESP/gpt2-spanish \
    --tokenizer DeepESP/gpt2-spanish \
    --train_path ../data/catalan_corpus_train.csv \
    --test_path ../data/catalan_corpus_test.csv \
    --n_epochs 1 \
    --train_batch_size 4 \
    --eval_batch_size 8 \
    --eval_steps 100 \
    --save_steps 1000 \
    --warmup_steps 100 \
    --output gpt2-catalan

2.3. About the data used ๐Ÿ“‚ open_file_folder

The data used has mostly been the WikiCorpus data provided by the Computer Science department @ FIB, UPC (Facultat d'Informร tica de Barcelona, Universitat Politรจcnica de Catalunya).

You can download it using the datasets library from Huggingface:

from datasets import load_dataset

dataset = load_dataset("wikicorpus, 'raw_ca')

Or you can use the download_wikicorpus.py file in this repository, which also splits the data in train/test and can create a smaller subset for testing, if desired.

2.3.1. WikiCorpus PROs ๐Ÿ‘

Well, the data is already obtained. That's always a pro.

2.3.2. WikiCorpus CONs ๐Ÿ‘Ž

We are limiting the knowledge of the Language model to data from the Wikipedia. Therefore, this model will probably be more error-prone with informal text inputs. This includes data from chats, colloquialisms and text from social media.

Additionally, the size of the data is tiny with respect to what it should be.

Further training for specific tasks โšก

Once the model is trained in Catalan and we have a base, we can further train this model for a specific task in mind.

A couple of Proof of Concepts (PoC) have been done using data gathered from Twitter and also from Catalan songs.

Testing the model ๐Ÿฑ

We can test the trained model easily using the script test_generation.py.

cd src
python .\test_generation.py -t DeepESP/gpt2-spanish -m ../data/gpt2-catalan -i generation_test.txt

3. Questions โ“ โ”

3.1. Why Catalan โ“

Artificial Intelligence should not be only for largely spoken languages, such as English or even Spanish. Catalan, a minority language, is my mother tongue and it's always fun to see something you work with also operating in your own language. So why not?

3.2. Why use a Pretrained model in Spanish โ”

Although Spanish and Catalan are different languages, they share a lot of expressions, vocabulary and grammatical structures. Therefore, basing a Catalan model on a previously trained model in a close language such as Spanish is not unreasonable.

Transferring the knowledge from it to our model is better than starting from zero, specially to save computational time.

3.3. Can I use another data/language โ“

Even though the scripts are all prepared with the Catalan language in mind, the scripts should work with any text data, be it Catalan from the Wikicorpus,

Feel free to change the CatalanDataset class or swap it with yours, since probably formatting of the input text is the most varying aspect between projects.

Be sure to also change the base model, since if you want to train another language (e.g. German), basing it on a pre-trained model in Spanish will not work well.

4. TO-DO ๐Ÿšง

Since we are actually using the Transfer learning approach and relying on a previously pretrained model in Spanish, we probably don't have as an accurate model as we should.

More varied data should also be used during the training, because it is very biased towards informative data (for obvious reasons).

Owner
Laura
.
Laura
[Preprint] Escaping the Big Data Paradigm with Compact Transformers, 2021

Compact Transformers Preprint Link: Escaping the Big Data Paradigm with Compact Transformers By Ali Hassani[1]*, Steven Walton[1]*, Nikhil Shah[1], Ab

SHI Lab 367 Dec 31, 2022
Korean Simple Contrastive Learning of Sentence Embeddings using SKT KoBERT and kakaobrain KorNLU dataset

KoSimCSE Korean Simple Contrastive Learning of Sentence Embeddings implementation using pytorch SimCSE Installation git clone https://github.com/BM-K/

34 Nov 24, 2022
Natural Language Processing library built with AllenNLP ๐ŸŒฒ๐ŸŒฑ

Custom Natural Language Processing with big and small models ๐ŸŒฒ๐ŸŒฑ

Recognai 65 Sep 13, 2022
๐ŸŠ PAUSE (Positive and Annealed Unlabeled Sentence Embedding), accepted by EMNLP'2021 ๐ŸŒด

PAUSE: Positive and Annealed Unlabeled Sentence Embedding Sentence embedding refers to a set of effective and versatile techniques for converting raw

EQT 21 Dec 15, 2022
Predict the spans of toxic posts that were responsible for the toxic label of the posts

toxic-spans-detection An attempt at the SemEval 2021 Task 5: Toxic Spans Detection. The Toxic Spans Detection task of SemEval2021 required participant

Ilias Antonopoulos 3 Jul 24, 2022
HAN2HAN : Hangul Font Generation

HAN2HAN : Hangul Font Generation

Changwoo Lee 36 Dec 28, 2022
โšก Automatically decrypt encryptions without knowing the key or cipher, decode encodings, and crack hashes โšก

Translations ๐Ÿ‡ฉ๐Ÿ‡ช DE ๐Ÿ‡ซ๐Ÿ‡ท FR ๐Ÿ‡ญ๐Ÿ‡บ HU ๐Ÿ‡ฎ๐Ÿ‡ฉ ID ๐Ÿ‡ฎ๐Ÿ‡น IT ๐Ÿ‡ณ๐Ÿ‡ฑ NL ๐Ÿ‡ง๐Ÿ‡ท PT-BR ๐Ÿ‡ท๐Ÿ‡บ RU ๐Ÿ‡จ๐Ÿ‡ณ ZH โžก๏ธ Documentation | Discord | Installation Guide โฌ…๏ธ Fully autom

11.2k Jan 05, 2023
Sentiment-Analysis and EDA on the IMDB Movie Review Dataset

Sentiment-Analysis and EDA on the IMDB Movie Review Dataset The main part of the work focuses on the exploration and study of different approaches whi

Nikolas Petrou 1 Jan 12, 2022
Proquabet - Convert your prose into proquints and then you essentially have Vogon poetry

Proquabet Turn your prose into a constant stream of encrypted and meaningless-so

Milo Fultz 2 Oct 10, 2022
Text to speech converter with GUI made in Python.

Text-to-speech-with-GUI Text to speech converter with GUI made in Python. To run this download the zip file and run the main file or clone this repo.

SidTheMiner 1 Nov 15, 2021
This is the 25 + 1 year anniversary version of the 1995 Rachford-Rice contest

Rachford-Rice Contest This is the 25 + 1 year anniversary version of the 1995 Rachford-Rice contest. Can you solve the Rachford-Rice problem for all t

13 Sep 20, 2022
This is an incredibly powerful calculator that is capable of many useful day-to-day functions.

Description ๐Ÿ’ป This is an incredibly powerful calculator that is capable of many useful day-to-day functions. Such functions include solving basic ari

Jordan Leich 37 Nov 19, 2022
Simple program that translates the name of files into English

Simple program that translates the name of files into English. Useful for when editing/inspecting programs that were developed in a foreign language.

0 Dec 22, 2021
Implementation of TTS with combination of Tacotron2 and HiFi-GAN

Tacotron2-HiFiGAN-master Implementation of TTS with combination of Tacotron2 and HiFi-GAN for Mandarin TTS. Inference In order to inference, we need t

SunLu Z 7 Nov 11, 2022
Diffรฉrents programmes crรฉant une interface graphique a l'aide de Tkinter pour simplifier la vie des รฉtudiants.

GP211-Grand-Projet Ce repertoire contient tout les programmes nรฉcessaires au bon fonctionnement de notre projet-logiciel. Cette interface graphique es

1 Dec 21, 2021
A program that uses real statistics to choose the best times to bet on BloxFlip's crash gamemode

Bloxflip Smart Bet A program that uses real statistics to choose the best times to bet on BloxFlip's crash gamemode. https://bloxflip.com/crash. THIS

43 Jan 05, 2023
Utilities for preprocessing text for deep learning with Keras

Note: This utility is really old and is no longer maintained. You should use keras.layers.TextVectorization instead of this. Utilities for pre-process

Hamel Husain 180 Dec 09, 2022
An ActivityWatch watcher to pose questions to the user and record her answers.

aw-watcher-ask An ActivityWatch watcher to pose questions to the user and record her answers. This watcher uses Zenity to present dialog boxes to the

Bernardo Chrispim Baron 33 Dec 03, 2022
This repository contains examples of Task-Informed Meta-Learning

Task-Informed Meta-Learning This repository contains examples of Task-Informed Meta-Learning (paper). We consider two tasks: Crop Type Classification

10 Dec 19, 2022
SurvTRACE: Transformers for Survival Analysis with Competing Events

โญ SurvTRACE: Transformers for Survival Analysis with Competing Events This repo provides the implementation of SurvTRACE for survival analysis. It is

Zifeng 13 Oct 06, 2022