GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model

Last update: Jan 01, 2023

Related tags

Overview

GPT-Code-Clippy (GPT-CC)

Please refer to our new GitHub Wiki which documents our efforts in detail in creating the open source version of GitHub Copilot

Courtesy of the awesome Aimee Trevett!

Introduction

GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model -- based on GPT-3, called GPT-Codex -- that is fine-tuned on publicly available code from GitHub.

Datasets

The dataset used to train GPT-CC is obtained from SEART GitHub Search using the following criteria:

>10 GitHub stars
>2 commits
Must have a licence
Exclude forks
Size < 70708 bytes

These repositories are then combined with all of the GitHub repositories contain in The Pile.

The repositories are then filtered for duplicate files. Filtering is performed by regexing each file in each repository to obtain a list of "variables" (the tokens which only contain alphanumeric characters) and then filtering out any files which contain the same sequence of "variables. The deduplication script is available here.

The final dataset is available here. The dataset without the duplicates filtered out is also available here.

The datasheet discussing in more detail the construction, usage, and limitation of the dataset can be found here. We hope to get it officially into Huggingface's datasets library soon!

Models

The GPT-CC models are fine-tuned versions of GPT-2 and GPT-Neo.

The available models can be found here

The ones that perform relatively well (None improve on the standard GPT-Neo 125M model except for APPs specific models and only for the APPs task):

TODO: which is the recommended model?

Training

Training is done using the training scripts available here.

For fine-tuning GPTNeo-125M on CodeClippy dataset we used AdamW optimizer (beta1=0.9, beta2=0.95) with GPT3-like learning rate schedule (4k warmup steps from 0 to 5e-5 followed by 50k cosine decay steps to 5e-6), weight decay 0.1 and batch size 1024, sequence length 2048. The choice of relatively large batch size and low LR with long warmup are made to avoid agressive updates and preserve the knowledge contained in pretrained GPTNeo weights.

For fine-tuning GPTNe0-125M on APPS dataset we used AdamW optimizer (beta1=0.9, beta2=0.98) with linear learning rate schedule (800 warmup steps from 0 to peak LR followed by linear decay to 0, a range of value for peak LR was [1e-5; 1e-4]), weight decay 0.1 and batch size 256, sequence length 1024. We trained model for 5 epochs selecting best checkpoint judging by validation loss. The language modelling objective for APPS dataset is modified to backpropagate loss only for the tokens corresponding to code solution (refer to Hendrycks et al for more details).

For fine-tuning GPTNe0-1.3B on APPS dataset we used Adafactor optimizer with linear learning rate schedule (5k warmup steps from 0 to 2e-5 followed by linear decay to 0), weight decay 0.1 and batch size 24, sequence length 1024. The choice of hyperparameters for 1.3B model is in part determined by hardware limitations. We trained model for 5 epochs selecting best checkpoint judging by validation loss.

TODO: which is the recommended way to train GPT-CC?

Evaluation

The models are also evaluated on the APPS and HumanEval datasets.

Human Eval Results

Model	[email protected]	[email protected]	[email protected]	[email protected]
EleutherAI/gpt-neo	0.12%	0.24%	0.61%	1.22%
gpt-neo-125M-apps	0.06%	0.12%	0.30%	0.61%
dedup-filtered-no-resize-2048bs	0.00%	0.00%	0.00%	0.00%
1024-filtered	0.00%	0.00%	0.00%	0.00%
dedup-2048	0.00%	0.00%	0.00%	0.00%

APPS Eval Results

Coming soon...

Demo

A Visual Studio Code which uses the HuggingFace Inference API is available and can be found here.

We also have Huggingface's Space demo where you can specify and problem in the format of a programming competition question.

TODO: more information about this when complete.

Acknowledgements

Special thanks to our contributors!!

https://github.com/arampacha
https://github.com/ncoop57
https://github.com/bentrevett
https://github.com/arunraja-hub
https://github.com/reshinthadithyan
https://github.com/shpotes
https://github.com/neubig
https://github.com/Mrinal18
and everyone else that helped out the project!

GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model

Related tags

Overview

GPT-Code-Clippy (GPT-CC)

Introduction

Datasets

Models

Training

Evaluation

Human Eval Results

APPS Eval Results

Demo

Further Reading

Acknowledgements

Owner

Nathan Cooper

無料で使える中品質なテキスト読み上げソフトウェア、VOICEVOXの音声合成エンジン

Machine translation models released by the Gourmet project

Learning to Rewrite for Non-Autoregressive Neural Machine Translation

Samantha, A covid-19 information bot which will provide basic information about this pandemic in form of conversation.

The guide to tackle with the Text Summarization

sangha, pronounced "suhng-guh", is a social networking, booking platform where students and teachers can share their practice.

Klexikon: A German Dataset for Joint Summarization and Simplification

NeurIPS'21: Probabilistic Margins for Instance Reweighting in Adversarial Training (Pytorch implementation).

🐍 A hyper-fast Python module for reading/writing JSON data using Rust's serde-json.

Uncomplete archive of files from the European Nopsled Team

Implementation of paper Does syntax matter? A strong baseline for Aspect-based Sentiment Analysis with RoBERTa.

Repository of the Code to Chatbots, developed in Python

Code-autocomplete, a code completion plugin for Python

本项目是作者们根据个人面试和经验总结出的自然语言处理(NLP)面试准备的学习笔记与资料，该资料目前包含自然语言处理各领域的面试题积累。

Pipeline for chemical image-to-text competition

Maix Speech AI lib, including ASR, chat, TTS etc.

EMNLP 2021 paper "Pre-train or Annotate? Domain Adaptation with a Constrained Budget".

Multilingual word vectors in 78 languages

Basic yet complete Machine Learning pipeline for NLP tasks

Unofficial Implementation of Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model

Related tags

Overview

GPT-Code-Clippy (GPT-CC)

Introduction

Datasets

Models

Training

Evaluation

Human Eval Results

APPS Eval Results

Demo

Further Reading

Acknowledgements

Owner

Nathan Cooper

無料で使える中品質なテキスト読み上げソフトウェア、VOICEVOXの音声合成エンジン

Machine translation models released by the Gourmet project

Learning to Rewrite for Non-Autoregressive Neural Machine Translation

Samantha, A covid-19 information bot which will provide basic information about this pandemic in form of conversation.

The guide to tackle with the Text Summarization

sangha, pronounced "suhng-guh", is a social networking, booking platform where students and teachers can share their practice.

Klexikon: A German Dataset for Joint Summarization and Simplification

NeurIPS'21: Probabilistic Margins for Instance Reweighting in Adversarial Training (Pytorch implementation).

🐍 A hyper-fast Python module for reading/writing JSON data using Rust's serde-json.

Uncomplete archive of files from the European Nopsled Team

Implementation of paper Does syntax matter? A strong baseline for Aspect-based Sentiment Analysis with RoBERTa.

Repository of the Code to Chatbots, developed in Python

Code-autocomplete, a code completion plugin for Python

本项目是作者们根据个人面试和经验总结出的自然语言处理(NLP)面试准备的学习笔记与资料，该资料目前包含 自然语言处理各领域的 面试题积累。

Pipeline for chemical image-to-text competition

Maix Speech AI lib, including ASR, chat, TTS etc.

EMNLP 2021 paper "Pre-train or Annotate? Domain Adaptation with a Constrained Budget".

Multilingual word vectors in 78 languages

Basic yet complete Machine Learning pipeline for NLP tasks

Unofficial Implementation of Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

本项目是作者们根据个人面试和经验总结出的自然语言处理(NLP)面试准备的学习笔记与资料，该资料目前包含自然语言处理各领域的面试题积累。