NaturalCC is a sequence modeling toolkit that allows researchers and developers to train custom models

Overview

NaturalCC

NaturalCC is a sequence modeling toolkit that allows researchers and developers to train custom models for many software engineering tasks, e.g., code summarization, code retrieval, code completion, code clone detection and type inference. Our vision is to bridge the gap between programming language and natural language through machine learning techniques.

Version Python pytorch license


Features

  • A collection of code corpus with data preprocessing
  • Performance benchmark
  • Mixed precision training
    • Nvidia APEX
    • Automatic Mixed Precision
  • Multi-GPU training
  • Better logging output
  • Various Implementations:
    • tensorflow gradient clipping
    • optimizers or learning schedulers
    • baseline models
    • binary data formats

🚀 Installation

Requirements

  • PyTorch version >= 1.6.0
  • Python version >= 3.6
  • GCC/G++ > 5.0
  • For training new models, you'll also need an NVIDIA GPU and NCCL
  • (optional) For faster training, you need to install NVIDIA's apex library.

1. Install prerequisite libraries

git clone https://github.com/xcodemind/naturalcc && cd naturalcc
pip install -r requirements.txt

Once you installed prerequisite libraries, you can check them via python -m env_test

2. Build or install NaturalCC

Export your NaturalCC cache directory (data and models will be saved in this directory) to user variables(~/.bashrc or ~/.zshrc).

> ~/.bashrc">
echo "export NCC=/data/ncc_data" >> ~/.bashrc

Note: PyCharm cannot get environment variables and, therefore, we recommend you to register your NCC variable at ncc/__init__.py.

Compile Cython files to accelerate programs and register NaturalCC into your pip list

# compile for debug
# python setup.py build_ext --inplace
# install 
pip install --editable ./

3. Half precision computation (optional)

NaturalCC supports half precision training.

  • If your Pytorch.__version__ < 1.6.0 and nvcc -V is runnable, please install apex.
  • Otherwise, use Automatic Mixed Precision (AMP). Available Now (set amp: 1 in yaml file, An example).

4. Install GCC/G++ with conda (if you do not have permission)

Since NCC is build via Cython, your GCC/G++ version should be greater than 4.9. If you have the root permission, update GCC/G++; otherwise, install GCC/G++ with conda.

# install GCC/G++ with conda
conda install -c anaconda gxx_linux-64
conda install -c conda-forge gcc_linux-64
cd ~/anaconda/envs/XXX/bin
ln -s x86_64-conda_cos6-linux-gnu-gcc gcc
ln -s x86_64-conda_cos6-linux-gnu-g++ g++
# check
conda deactivate
conda activate XXX
>> type "gcc/g++ -v" in terminals

📚 Dataset

Currently, we have processed the following datasets:

🤖 Implementations

Code retrieval (search)

Code completion

Heterogeneous mapping

Code summarization

📋 Experiments

Code Summarization

Dataset: Python (Wan et al.)

BLEU-4 METEOR ROUGE-L Cost Logs
Seq2Seq+Attn 25.57 14.40 39.41 0.09s/b click here
Tree2Seq+Attn 23.35 12.59 36.49 0.48s/b click here
Transformer 30.64 17.65 44.59 0.26s/b click here
Transformer+RPE 31.57 17.74 45.18 0.27s/b click here
PLBART 32.71 18.13 46.05 0.80s/b TBC

Code Retrieval

Dataset: CodeSearchNet (Husain et al.)

MRR Go Java JS PHP Python Ruby Cost Logs
NBOW 66.59 59.92 47.15 54.75 63.33 42.86 0.16s/b click here
ConV1d 70.87 60.49 38.81 61.92 67.29 36.53 0.30s/b click here
BiRNN 65.80 48.60 23.23 51.36 48.28 19.35 0.74s/b click here
SelfAttn 78.45 66.55 50.38 65.78 79.09 47.96 0.25s/b click here

Code Completion

Dataset: Py150 (official processed) (raw)

MRR Attr Num Name Param Tokens Cost Logs
LSTM 51.67 47.45 46.52 66.06 73.73 0.31s/b click here
GTP-2 70.37 62.20 63.84 73.54 82.17 0.43s/b click here
TravTrans 72.08 68.55 76.33 71.08 83.17 0.43s/b click here

Type Inference

Dataset: CodeSearchNet-Java (Husain et al.)

[email protected] (All types) [email protected] (All types) [email protected] (Any types) [email protected] (Any types) Cost Logs
DeepTyper 0.52 0.67 0.43 0.67 0.42s/b TBC
Transformer 0.32 0.64 0.37 0.75 0.85s/b TBC

Heterogeneous Mapping

Dataset: OpenCL (Grewe et al.)

Accuracy AMD NVIDIA
Static mapping 58.82 56.91
Decision tree 70.29 74.56
Inst2vec 82.79 81.76
DeepTune 83.24 80.15

🏫 Examples & Tutorials

All the running commands here should be executed in the root of project folder (the path of your naturalcc). For example, in my environment I will stay at /data/wanyao/Dropbox/ghproj-v100/naturalcc.

We also have more detailed READMEs to start your tutorial of NaturalCC.

Step 1: Download and process a dataset from datasets, and follow the instructions from the README.md file.

# ref: dataset/python_wan/README.md
# download dataset
bash dataset/python_wan/download.sh
# clean data
python -m dataset.python_wan.clean
# cast data attributes into different files
python -m dataset.python_wan.attributes_cast

# ref: dataset/python_wan/summarization/README.md
# save code tokens and docstirng tokens into MMAP format
python -m dataset.python_wan.summarization.preprocess

Step 2 (optional): Register your self-defined models

  • If you want to create a new model, please add your model at ncc/models and ncc/modules.

  • If your training policy are more complex than we thought, you should update your criterions and training procedure at ncc/criterions and ncc/trainers, respectively.

    Do not forget to update your self defined module at ncc/XX/__init__.py.

Step 3: Training and inference.

  • Select a task and a model from task list and follow the instructions in its README.md to start your learning.
# ref: run/summarization/transformer/README.md
# train
CUDA_VISIBLE_DEVICES=0,1,2,3 nohup python -m run.summarization.transformer.train -f config/python_wan/python > run/summarization/transformer/config/python_wan/python.log 2>&1 &
# inference
CUDA_VISIBLE_DEVICES=0 python -m run.summarization.transformer.eval -f config/python_wan/python -o run/summarization/transformer/config/python_wan/python.txt

FAQ

Please fell free to contact me if you have any troubles.

😘 License and Acknowledgement

NaturalCC is MIT-licensed. The license applies to the pre-trained models as well. This project is also highly inspired by Fairseq and AllenNLP.

🔗 Related Links

NaturalCC-demo
About us: XCodeMind

❤️ Citation

Please cite as:

under reviewing
Adversarial Robustness Comparison of Vision Transformer and MLP-Mixer to CNNs

Adversarial Robustness Comparison of Vision Transformer and MLP-Mixer to CNNs ArXiv Abstract Convolutional Neural Networks (CNNs) have become the de f

Philipp Benz 12 Oct 24, 2022
This is the official pytorch implementation for our ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering" on VQA Task

🌈 ERASOR (RA-L'21 with ICRA Option) Official page of "ERASOR: Egocentric Ratio of Pseudo Occupancy-based Dynamic Object Removal for Static 3D Point C

Hyungtae Lim 225 Dec 29, 2022
A denoising diffusion probabilistic model synthesises galaxies that are qualitatively and physically indistinguishable from the real thing.

Realistic galaxy simulation via score-based generative models Official code for 'Realistic galaxy simulation via score-based generative models'. We us

Michael Smith 32 Dec 20, 2022
Tutorial: Introduction to Graph Machine Learning, with Jupyter notebooks

GraphMLTutorialNLDL22 Tutorial NLDL22: Introduction to Graph Machine Learning, with Jupyter notebooks This tutorial takes place during the conference

UiT Machine Learning Group 3 Jan 10, 2022
GrailQA: Strongly Generalizable Question Answering

GrailQA is a new large-scale, high-quality KBQA dataset with 64,331 questions annotated with both answers and corresponding logical forms in different syntax (i.e., SPARQL, S-expression, etc.). It ca

OSU DKI Lab 76 Dec 21, 2022
This is code of book "Learn Deep Learning with PyTorch"

深度学习入门之PyTorch Learn Deep Learning with PyTorch 非常感谢您能够购买此书,这个github repository包含有深度学习入门之PyTorch的实例代码。由于本人水平有限,在写此书的时候参考了一些网上的资料,在这里对他们表示敬意。由于深度学习的技术在

Xingyu Liao 2.5k Jan 04, 2023
Using Machine Learning to Test Causal Hypotheses in Conjoint Analysis

Readme File for "Using Machine Learning to Test Causal Hypotheses in Conjoint Analysis" by Ham, Imai, and Janson. (2022) All scripts were written and

0 Jan 27, 2022
An implementation for `Text2Event: Controllable Sequence-to-Structure Generation for End-to-end Event Extraction`

Text2Event An implementation for Text2Event: Controllable Sequence-to-Structure Generation for End-to-end Event Extraction Please contact Yaojie Lu (@

Roger 153 Jan 07, 2023
Fully convolutional deep neural network to remove transparent overlays from images

Fully convolutional deep neural network to remove transparent overlays from images

Marc Belmont 1.1k Jan 06, 2023
ScriptProfilerPy - Module to visualize where your python script is slow

ScriptProfiler helps you track where your code is slow It provides: Code lines t

Lucas BLP 3 Jun 02, 2022
Implementation of CSRL from the AAAI2022 paper: Constraint Sampling Reinforcement Learning: Incorporating Expertise For Faster Learning

CSRL Implementation of CSRL from the AAAI2022 paper: Constraint Sampling Reinforcement Learning: Incorporating Expertise For Faster Learning Python: 3

4 Apr 14, 2022
Unofficial PyTorch reimplementation of the paper Swin Transformer V2: Scaling Up Capacity and Resolution

PyTorch reimplementation of the paper Swin Transformer V2: Scaling Up Capacity and Resolution [arXiv 2021].

Christoph Reich 122 Dec 12, 2022
Code for "Adversarial Attack Generation Empowered by Min-Max Optimization", NeurIPS 2021

Min-Max Adversarial Attacks [Paper] [arXiv] [Video] [Slide] Adversarial Attack Generation Empowered by Min-Max Optimization Jingkang Wang, Tianyun Zha

Jingkang Wang 12 Nov 23, 2022
DANet for Tabular data classification/ regression.

Deep Abstract Networks A pyTorch implementation for AAAI-2022 paper DANets: Deep Abstract Networks for Tabular Data Classification and Regression. Bri

Ronnie Rocket 55 Sep 14, 2022
General Multi-label Image Classification with Transformers

General Multi-label Image Classification with Transformers Jack Lanchantin, Tianlu Wang, Vicente Ordóñez Román, Yanjun Qi Conference on Computer Visio

QData 154 Dec 21, 2022
the official code for ICRA 2021 Paper: "Multimodal Scale Consistency and Awareness for Monocular Self-Supervised Depth Estimation"

G2S This is the official code for ICRA 2021 Paper: Multimodal Scale Consistency and Awareness for Monocular Self-Supervised Depth Estimation by Hemang

NeurAI 4 Jul 27, 2022
YOLTv4 builds upon YOLT and SIMRDWN, and updates these frameworks to use the most performant version of YOLO, YOLOv4

YOLTv4 builds upon YOLT and SIMRDWN, and updates these frameworks to use the most performant version of YOLO, YOLOv4. YOLTv4 is designed to detect objects in aerial or satellite imagery in arbitraril

Adam Van Etten 161 Jan 06, 2023
Advanced yabai wooting scripts

Yabai Wooting scripts Installation requirements Both https://github.com/xiamaz/python-yabai-client and https://github.com/xiamaz/python-wooting-rgb ne

Max Zhao 3 Dec 31, 2021
Freecodecamp Scientific Computing with Python Certification; Solution for Challenge 2: Time Calculator

Assignment Write a function named add_time that takes in two required parameters and one optional parameter: a start time in the 12-hour clock format

Hellen Namulinda 0 Feb 26, 2022
null

DeformingThings4D dataset Video | Paper DeformingThings4D is an synthetic dataset containing 1,972 animation sequences spanning 31 categories of human

208 Jan 03, 2023