Contains the code and data for our #ICSE2022 paper titled as "CodeFill: Multi-token Code Completion by Jointly Learning from Structure and Naming Sequences"

Last update: Oct 31, 2022

Overview

CodeFill

This repository contains the code for our paper titled as "CodeFill: Multi-token Code Completion by Jointly Learning from Structure and Naming Sequences", DOI: 10.1145/3510003.3510172. This work is authored by Maliheh Izadi, Roberta Gismondi, and Georgios Gousios and it has been accepted for publication at #ICSE2022.

Abstract

Code completion is an essential feature of IDEs, yet current autocompleters are restricted to either grammar-based or NLP-based single token completions. Both approaches have significant drawbacks: grammar-based autocompletion is restricted in dynamically-typed language environments, whereas NLP-based autocompleters struggle to understand the semantics of the programming language and the developer's code context.

In this work, we present CodeFill, a language model for autocompletion that combines learned structure and naming information. Using a parallel Transformer architecture and multi-task learning, CodeFill consumes sequences of source code token names and their equivalent AST token types. Uniquely, CodeFill is trained both for single-token and multi-token (statement) prediction, which enables it to learn long-range dependencies among grammatical and naming elements. We train CodeFill on two datasets, consisting of 29M and 425M lines of code, respectively. To make the evaluation more realistic, we develop a method to automatically infer points in the source code at which completion matters. We compare CodeFill against four baselines and two state-of-the-art models, GPT-C and TravTrans+. CodeFill surpasses all baselines in single token prediction (MRR: 70.9% vs. 66.2% and 67.8%) and outperforms the state of the art for multi-token prediction (ROUGE-L: 63.7% vs. 52.4% and 59.2%, for n=4 tokens). We publicly release our source code and datasets.

Data

Our datasets are available on HuggingFace hub.

Contains the code and data for our #ICSE2022 paper titled as "CodeFill: Multi-token Code Completion by Jointly Learning from Structure and Naming Sequences"

Related tags

Overview

CodeFill

Abstract

Data

Owner

Software Analytics Lab

Need: Image Search With Python

Bpe algorithm can finetune tokenizer - Bpe algorithm can finetune tokenizer

Command Line Text-To-Speech using Google TTS

Code for "Generating Disentangled Arguments with Prompts: a Simple Event Extraction Framework that Works"

Rhythm-Finder is a unsupervised ML driven python powered web-application that can find the songs that suits you.

Natural Language Processing Best Practices & Examples

scikit-learn wrappers for Python fastText.

硕士期间自学的NLP子任务，供学习参考

Composed Image Retrieval using Pretrained LANguage Transformers (CIRPLANT)

A collection of Korean Text Datasets ready to use using Tensorflow-Datasets.

Toward Model Interpretability in Medical NLP

Train BPE with fastBPE, and load to Huggingface Tokenizer.

BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions

This repository will contain the code for the CVPR 2021 paper "GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields"

COVID-19 Related NLP Papers

Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks

End-to-end image captioning with EfficientNet-b3 + LSTM with Attention

Takes a string and puts it through different languages in Google Translate a requested amount of times, returning nonsense.

Lyrics generation with GPT2-based Transformer

InferSent sentence embeddings