Semantic code search implementation using Tensorflow framework and the source code data from the CodeSearchNet project

Overview

Semantic Code Search

Semantic code search implementation using Tensorflow framework and the source code data from the CodeSearchNet project. The model training pipeline was based on the implementation in CodeSearchNet repository. Python, Java, Go, Php, Javascript, and Ruby programming language are supported.

Model Description

BPE tokenizer is used to encode both code strings and query strings(docstrings are used as queries in training). Code strings are padded and encoded to the length of 200 tokens. Query strings are padded and encoded to the length of 30 tokens. Both code embedding size and query embedding size are 256. Token embeddings are masked and then an unweighted mean is performed to get a vector with 256 dimensions for code strings and query strings. Cosine similarity is calculated between the code representations and the query representations. Further details can be found on the WANDB run

Model Structure

  • Deep Structured Semantic Model
  • Wide & Deep Learning

Project Structure

Python package with scripts to prepare the data, train/test the model and predict.

Data

We use the data from the CodeSearchNet project. The downloaded data is around 20GB.

Training the model

To install the reqiured dependencies

pip3 install -r requirements.txt

Preparing data

Data preparation step is seperated from the training step because of computing time and memory consumption.

Training and evaluation

Start the training

python3 -m train --model neuralbow_v1

The model will be trained for each language. The evaluation metric is MRR for validation and test sets, however, the output of prediction will be evaluated by GitHub using nDCG.

Model Downloads

Query the trained model

Predict

python3 predict.py -r wuchen/SemanticCodeSearch/1fpfl6dq

Online Semantic Code Search website

  • Requirements: Flask
  • Import source code file
  • Running the dev server

Model Server

python3 -m server.main

Web Front-End

cd react-code-search
npm install
npm start
Owner
Chen Wu
More is different.
Chen Wu
Repo for paper "Dynamic Placement of Rapidly Deployable Mobile Sensor Robots Using Machine Learning and Expected Value of Information"

Repo for paper "Dynamic Placement of Rapidly Deployable Mobile Sensor Robots Using Machine Learning and Expected Value of Information" Notes I probabl

Berkeley Expert System Technologies Lab 0 Jul 01, 2021
J.A.R.V.I.S is an AI virtual assistant made in python.

J.A.R.V.I.S is an AI virtual assistant made in python. Running JARVIS Without Python To run JARVIS without python: 1. Head over to our installation pa

somePythonProgrammer 16 Dec 29, 2022
VL-LTR: Learning Class-wise Visual-Linguistic Representation for Long-Tailed Visual Recognition

VL-LTR: Learning Class-wise Visual-Linguistic Representation for Long-Tailed Visual Recognition Usage First, install PyTorch 1.7.1+, torchvision 0.8.2

40 Dec 12, 2022
The dynamics of representation learning in shallow, non-linear autoencoders

The dynamics of representation learning in shallow, non-linear autoencoders The package is written in python and uses the pytorch implementation to ML

Maria Refinetti 4 Jun 08, 2022
Google Landmark Recogntion and Retrieval 2021 Solutions

Google Landmark Recogntion and Retrieval 2021 Solutions In this repository you can find solution and code for Google Landmark Recognition 2021 and Goo

Vadim Timakin 5 Nov 25, 2022
TensorFlow (Python) implementation of DeepTCN model for multivariate time series forecasting.

DeepTCN TensorFlow TensorFlow (Python) implementation of multivariate time series forecasting model introduced in Chen, Y., Kang, Y., Chen, Y., & Wang

Flavia Giammarino 21 Dec 19, 2022
Semantic Segmentation in Pytorch. Network include: FCN、FCN_ResNet、SegNet、UNet、BiSeNet、BiSeNetV2、PSPNet、DeepLabv3_plus、 HRNet、DDRNet

🚀 If it helps you, click a star! ⭐ Update log 2020.12.10 Project structure adjustment, the previous code has been deleted, the adjustment will be re-

Deeachain 269 Jan 04, 2023
PyTorch implementation of our ICCV 2021 paper Intrinsic-Extrinsic Preserved GANs for Unsupervised 3D Pose Transfer.

Unsupervised_IEPGAN This is the PyTorch implementation of our ICCV 2021 paper Intrinsic-Extrinsic Preserved GANs for Unsupervised 3D Pose Transfer. Ha

25 Oct 26, 2022
This repository is related to an Arabic tutorial, within the tutorial we discuss the common data structure and algorithms and their worst and best case for each, then implement the code using Python.

Data Structure and Algorithms with Python This repository is related to the Arabic tutorial here, within the tutorial we discuss the common data struc

Mohamed Ayman 33 Dec 02, 2022
This repository contains Prior-RObust Bayesian Optimization (PROBO) as introduced in our paper "Accounting for Gaussian Process Imprecision in Bayesian Optimization"

Prior-RObust Bayesian Optimization (PROBO) Introduction, TOC This repository contains Prior-RObust Bayesian Optimization (PROBO) as introduced in our

Julian Rodemann 2 Mar 19, 2022
STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech

STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech Keon Lee, Ky

Keon Lee 114 Dec 12, 2022
Opinionated code formatter, just like Python's black code formatter but for Beancount

beancount-black Opinionated code formatter, just like Python's black code formatter but for Beancount Try it out online here Features MIT licensed - b

Launch Platform 16 Oct 11, 2022
Official Implementation of VAT

Semantic correspondence Few-shot segmentation Cost Aggregation Is All You Need for Few-Shot Segmentation For more information, check out project [Proj

Hamacojr 114 Dec 27, 2022
N-Person-Check-Checker-Splitter - A calculator app use to divide checks

N-Person-Check-Checker-Splitter This is my from-scratch programmed calculator ap

2 Feb 15, 2022
Python Rapid Artificial Intelligence Ab Initio Molecular Dynamics

Python Rapid Artificial Intelligence Ab Initio Molecular Dynamics

14 Nov 06, 2022
A synthetic texture-invariant dataset for object detection of UAVs

A synthetic dataset for object detection of UAVs This repository contains a synthetic datasets accompanying the paper Sim2Air - Synthetic aerial datas

LARICS Lab 10 Aug 13, 2022
Certifiable Outlier-Robust Geometric Perception

Certifiable Outlier-Robust Geometric Perception About This repository holds the implementation for certifiably solving outlier-robust geometric percep

83 Dec 31, 2022
An end-to-end regression problem of predicting the price of properties in Bangalore.

Bangalore-House-Price-Prediction An end-to-end regression problem of predicting the price of properties in Bangalore. Deployed in Heroku using Flask.

Shruti Balan 1 Nov 25, 2022
Implementation of the paper: "SinGAN: Learning a Generative Model from a Single Natural Image"

SinGAN This is an unofficial implementation of SinGAN from someone who's been sitting right next to SinGAN's creator for almost five years. Please ref

35 Nov 10, 2022