Contains the code and data for our #ICSE2022 paper titled as "CodeFill: Multi-token Code Completion by Jointly Learning from Structure and Naming Sequences"

Last update: Oct 31, 2022

Overview

CodeFill

This repository contains the code for our paper titled as "CodeFill: Multi-token Code Completion by Jointly Learning from Structure and Naming Sequences", DOI: 10.1145/3510003.3510172. This work is authored by Maliheh Izadi, Roberta Gismondi, and Georgios Gousios and it has been accepted for publication at #ICSE2022.

Abstract

Code completion is an essential feature of IDEs, yet current autocompleters are restricted to either grammar-based or NLP-based single token completions. Both approaches have significant drawbacks: grammar-based autocompletion is restricted in dynamically-typed language environments, whereas NLP-based autocompleters struggle to understand the semantics of the programming language and the developer's code context.

In this work, we present CodeFill, a language model for autocompletion that combines learned structure and naming information. Using a parallel Transformer architecture and multi-task learning, CodeFill consumes sequences of source code token names and their equivalent AST token types. Uniquely, CodeFill is trained both for single-token and multi-token (statement) prediction, which enables it to learn long-range dependencies among grammatical and naming elements. We train CodeFill on two datasets, consisting of 29M and 425M lines of code, respectively. To make the evaluation more realistic, we develop a method to automatically infer points in the source code at which completion matters. We compare CodeFill against four baselines and two state-of-the-art models, GPT-C and TravTrans+. CodeFill surpasses all baselines in single token prediction (MRR: 70.9% vs. 66.2% and 67.8%) and outperforms the state of the art for multi-token prediction (ROUGE-L: 63.7% vs. 52.4% and 59.2%, for n=4 tokens). We publicly release our source code and datasets.

Data

Our datasets are available on HuggingFace hub.

Contains the code and data for our #ICSE2022 paper titled as "CodeFill: Multi-token Code Completion by Jointly Learning from Structure and Naming Sequences"

Related tags

Overview

CodeFill

Abstract

Data

Owner

Software Analytics Lab

Entity Disambiguation as text extraction (ACL 2022)

Code-autocomplete, a code completion plugin for Python

Simple, hackable offline speech to text - using the VOSK-API.

Multi-Task Pre-Training for Plug-and-Play Task-Oriented Dialogue System

Code for text augmentation method leveraging large-scale language models

Google's Meena transformer chatbot implementation

a CTF web challenge about making screenshots

To create a deep learning model which can explain the content of an image in the form of speech through caption generation with attention mechanism on Flickr8K dataset.

Turn clang-tidy warnings and fixes to comments in your pull request

Grapheme-to-phoneme (G2P) conversion is the process of generating pronunciation for words based on their written form.

NLP Core Library and Model Zoo based on PaddlePaddle 2.0

An assignment from my grad-level data mining course demonstrating some experience with NLP/neural networks/Pytorch

This is a project built for FALLABOUT2021 event under SRMMIC, This project deals with NLP poetry generation.

Train BPE with fastBPE, and load to Huggingface Tokenizer.

Easy, fast, effective, and automatic g-code compression!

InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective

NLP project that works with news (NER, context generation, news trend analytics)

German Text-To-Speech Engine using Tacotron and Griffin-Lim

BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions

Arabic-Phonetic-Output - You can input the phonetic version of any Arabic text here. This software will show you output in Arabic (with vowels)