⛔️ DEPRECATED This was a personal project I built to help in my research. As it got bigger, I thought others could find some of its features useful. However, with the fast development of NLP, others are doing it faster and better (e.g. Huggingface or fairseq). As such, I am no longer maintaining this repository.
Catbird
is an open source paraphrase generation toolkit based on PyTorch.
This is an ongoing, one-person project. Hopefully you find it useful. If you do so, do not forget to leave a star 🌟.
- Quora Question Pairs
- MSCOCO
We use the HuggingFace's Tokenizers package. As such, you can easily use any pretrained tokenizer. Additionally, you can train your own tokenizers, either using BPE, Unigram, WordPiece or word-level algorithms. To do so, you might find the wikitext-103 useful.
We support the following metrics. We currently use the HuggingFace implementations and wrap them to use with Pytorch Ignite.
- BLEU
- METEOR
- TER
We support Teacher Forcing and for decoding both greedy and beam search.
The project is based on PyTorch 1.11+ and Python 3.8+.
The package can be installed using pip:
pip install catbird
This does not include configuration files or tools and is not yet actively updated. Alternatively, you can run from the source code:
a. Clone the repository.
git clone https://github.com/AfonsoSalgadoSousa/catbird.git
b. Install dependencies.
This project uses Poetry as its package manager. Make sure you have it installed. For more info check Poetry's official documentation. To install dependencies, simply run:
poetry install
We have also compiled an enviroment.yml
file with all the required dependencies to create an Anaconda environment. To do so, simply run:
conda env create -f environment.yml
For now, we support Quora Question Pairs dataset, and MSCOCO. It is recommended to download and extract the datasets somewhere outside the project directory and symlink the dataset root to $CATBIRD/data
as below. If your folder structure is different, you may need to change the corresponding paths in config files.
catbird
├── catbird
├── tools
├── configs
├── data
│ ├── quora
│ │ ├── quora_duplicate_questions.tsv
│ ├── mscoco
│ │ ├── captions_train2014.json
│ │ ├── captions_val2014.json
Donwload Quora data HERE. Prepare Quora data by running:
poetry run python tools/preprocessing/create_data.py quora --root-path ./data/quora --out-dir ./data/quora
Download MSCOCO HERE, under the link '2014 Train/Val annotations'. Prepare MSCOCO data by running:
poetry run python tools/preprocessing/create_data.py mscoco --root-path ./data/mscoco --out-dir ./data/mscoco --split train
poetry run python tools/preprocessing/create_data.py mscoco --root-path ./data/mscoco --out-dir ./data/mscoco --split val
poetry run python tools/train.py ${CONFIG_FILE} [optional arguments]
Example:
- Train T5 on QQP.
$ poetry run python tools/train.py configs/t5_quora.yaml
This project borrowed ideas from the following open-source repositories: