AVATAR

Official code release of our work, AVATAR: A Parallel Corpus for Java-Python Program Translation.

Setup • Dataset • Models • Training & Evaluation • Benchmarks • License • Citation

📣 Notice related to a dataset bug (🐛) fix 👈

There was a major bug in the AVATAR dataset as raised in this issue. We observed that while crawling data from different sources, in many examples, new lines were missing. In Python data, we also observed missing indentation. As a result, programs were not parse-able. We re-crawled data and ensured every program we store is parse-able. The 🐛 has been fixed, so you can continue using the dataset seamlessly.

What is AVATAR?

AVATAR stands for jAVA-pyThon progrAm tRanslation.
AVATAR is a corpus of 9,515 programming problems and their solutions written in Java and Python.
AVATAR offers a collection of 3,391 parallel standalone functions, see details here.
AVATAR presents evaluation results of finetuned pre-trained LMs.
AVATAR performs execution based evaluation of program translation, see details here.

Setup

conda create --name avatar_env python==3.8
conda activate avatar_env
pip install -r requirements.txt

mkdir -p third_party
cd third_party
git clone https://github.com/tree-sitter/tree-sitter-java.git
git clone https://github.com/tree-sitter/tree-sitter-python.git

# optional (for fp16 training)
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .
cd ..

# building tree-sitter library
python build.py

Dataset

The dataset details is provided here. You can download the data by following:

cd data
bash download.sh

To prepare the data, we perform the following steps.

Removing docstrings, comments, etc.
Use baseline models' tokenizer to perform tokenization.
Filter data based on length threshold (~512).
Perform de-duplication. (remove examples that are duplicates)

If you want to perform the preparation of your own, run:

cd data
bash prepare.sh

Models

We studied 11 models for program translation.

[Models trained from scratch]

Seq2Seq+Attn. [1Lx512H], Transformer [6Lx512H]

[Pre-trained models]

CodeGPT, CodeGPT-adapted, CodeBERT, GraphCoderBERT, PLBART, CodeT5, TransCoder, TransCoder-DOBF, TransCoder-ST

Training & Evaluation

To train and evaluate a model, go to the corresponding model directory and execute the run.sh script.

# Seq2Seq+Attn, Transformer
cd seq2seq
bash rnn.sh GPU_ID SOURCE_LANG TARGET_LANG
bash transformer.sh GPU_ID SOURCE_LANG TARGET_LANG

# CodeBERT, GraphCoderBERT, CodeT5, PLBART
cd [codebert|graphcodebert|codet5|plbart]
bash run.sh GPU_ID SOURCE_LANG TARGET_LANG

# CodeGPT, CodeGPT-adapted
cd codegpt
bash run.sh GPU_ID SOURCE_LANG TARGET_LANG [CodeGPT|adaptedCodeGPT]

# Transcoder, Transcoder-DOBF, Transcoder-ST 
cd transcoder
bash zero_shot.sh GPU_ID SOURCE_LANG TARGET_LANG [transcoder|transcoder-dobf|transcoder-st]

Here, SOURCE_LANG=[java|python] or TARGET_LANG=[java|python].
Download pre-trained PLBART and Transcoder model checkpoints by running download.sh script.

Benchmarks

We perform n-gram and execution based evaluation of program and function translation.
We report the model performances in this spreadsheet.
For function translation error analysis, we categorize the errors, see details here.

License

This dataset is licensed under a Creative Commons Attribution-ShareAlike 4.0 International license, see the LICENSE file for details.

Citation

@article{ahmad-etal-2021-avatar,
  title={AVATAR: A Parallel Corpus for Java-Python Program Translation},
  author={Ahmad, Wasi Uddin and Tushar, Md Golam Rahman and Chakraborty, Saikat and Chang, Kai-Wei},
  journal={arXiv preprint arXiv:2108.11590},
  year={2021}
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
codebert		codebert
codegen		codegen
codegpt		codegpt
codet5		codet5
data		data
evaluation		evaluation
graphcodebert		graphcodebert
naivecopy		naivecopy
plbart		plbart
seq2seq		seq2seq
test_cases		test_cases
transcoder		transcoder
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.py		build.py
download.sh		download.sh
requirements.txt		requirements.txt
setup.py		setup.py

License

wasiahmad/AVATAR

Folders and files

Latest commit

History

Repository files navigation

AVATAR

📣 Notice related to a dataset bug (🐛) fix 👈

What is AVATAR?

Setup

Dataset

Models

Training & Evaluation

Benchmarks

License

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages