Trex is a tool to match semantically similar functions based on transfer learning.

Related tags

Text Data & NLPtrex
Overview

Introduction

Trex is a tool to match semantically similar functions based on transfer learning.

Installation

We recommend conda to setup the environment and install the required packages.

First, create the conda environment,

conda create -n trex python=3.8 numpy scipy scikit-learn requests

and activate the conda environment:

conda activate trex

Then, install the latest PyTorch (assume you have GPU):

conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c nvidia

Enter the trex root directory: e.g., path/to/trex, and install trex:

pip install --editable .

For large datasets install PyArrow:

pip install pyarrow

For faster training install NVIDIA's apex library:

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
  --global-option="--deprecated_fused_adam" --global-option="--xentropy" \
  --global-option="--fast_multihead_attn" ./

Preparation

Pretrained models:

Create the checkpoints and checkpoints/pretrain subdirectory in path/to/trex

mkdir checkpoints, mkdir checkpoints/pretrain

Download our pretrained weight parameters and put in checkpoints/pretrain

Sample data for finetuning similarity

We provide the sample training/testing files of finetuning in data-src/similarity If you want to prepare the finetuning data yourself, make sure you follow the format shown in data-src/similarity (coming soon: tokenization script).

We have to binarize the data to make it ready to be trained. To binarize the training data for finetuning, run:

python command/finetune/preprocess.py

The binarized training data ready for finetuning (for detecting similarity) will be stored at data-bin/similarity

Training

To finetune the model, run:

./command/finetune/finetune.sh

The scripts loads the pretrained weight parameters from checkpoints/pretrain/ and finetunes the model.

Sample data for pretraining on micro-traces

We also provide (10K) samples and scripts to demonstrate how to pretrain the model. To binarize the training data for pretraining, run:

python command/pretrain/preprocess_pretrain_10k.py

The binarized training data ready for pretraining will be stored at data-bin/pretrain_10k

To pretrain the model, run:

./command/pretrain/pretrain_10k.sh

The pretrained model will be checkpointed at checkpoints/pretrain_10k

Dataset

We put our dataset here.

Natural language Understanding Toolkit

Natural language Understanding Toolkit TOC Requirements Installation Documentation CLSCL NER References Requirements To install nut you need: Python 2

Peter Prettenhofer 119 Oct 08, 2022
A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

MIDI Language Introduction Reference Paper: Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions: code This

Robert Bogan Kang 3 May 25, 2022
Code for ACL 2021 main conference paper "Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances".

Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances This repository contains the code and pre-trained mode

ICTNLP 90 Dec 27, 2022
Using context-free grammar formalism to parse English sentences to determine their structure to help computer to better understand the meaning of the sentence.

Sentance Parser Executing the Program Make sure Python 3.6+ is installed. Install requirements $ pip install requirements.txt Run the program:

Vaibhaw 12 Sep 28, 2022
Question and answer retrieval in Turkish with BERT

trfaq Google supported this work by providing Google Cloud credit. Thank you Google for supporting the open source! ๐ŸŽ‰ What is this? At this repo, I'm

M. Yusuf Sarฤฑgรถz 13 Oct 10, 2022
Ecco is a python library for exploring and explaining Natural Language Processing models using interactive visualizations.

Visualize, analyze, and explore NLP language models. Ecco creates interactive visualizations directly in Jupyter notebooks explaining the behavior of Transformer-based language models (like GPT2, BER

Jay Alammar 1.6k Dec 25, 2022
Use PaddlePaddle to reproduce the paper๏ผšmT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

MT5_paddle Use PaddlePaddle to reproduce the paper๏ผšmT5: A Massively Multilingual Pre-trained Text-to-Text Transformer English | ็ฎ€ไฝ“ไธญๆ–‡ mT5: A Massively

2 Oct 17, 2021
YACLC - Yet Another Chinese Learner Corpus

ๆฑ‰่ฏญๅญฆไน ่€…ๆ–‡ๆœฌๅคš็ปดๆ ‡ๆณจๆ•ฐๆฎ้›†YACLC V1.0 ไธญๆ–‡ | English ๆฑ‰่ฏญๅญฆไน ่€…ๆ–‡ๆœฌๅคš็ปดๆ ‡ๆณจๆ•ฐๆฎ้›†๏ผˆYet Another Chinese Learner

BLCU-ICALL 47 Dec 15, 2022
A method to generate speech across multiple speakers

VoiceLoop PyTorch implementation of the method described in the paper VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop. VoiceLoop is a n

Facebook Archive 873 Dec 15, 2022
Implementation of the Hybrid Perception Block and Dual-Pruned Self-Attention block from the ITTR paper for Image to Image Translation using Transformers

ITTR - Pytorch Implementation of the Hybrid Perception Block (HPB) and Dual-Pruned Self-Attention (DPSA) block from the ITTR paper for Image to Image

Phil Wang 17 Dec 23, 2022
NLP library designed for reproducible experimentation management

Welcome to the Transfer NLP library, a framework built on top of PyTorch to promote reproducible experimentation and Transfer Learning in NLP You can

Feedly 290 Dec 20, 2022
A PyTorch implementation of the WaveGlow: A Flow-based Generative Network for Speech Synthesis

WaveGlow A PyTorch implementation of the WaveGlow: A Flow-based Generative Network for Speech Synthesis Quick Start: Install requirements: pip install

Yuchao Zhang 204 Jul 14, 2022
์ˆญ์‹ค๋Œ€ํ•™๊ต ์ปดํ“จํ„ฐํ•™๋ถ€ ์ „๊ณต์ข…ํ•ฉ์„ค๊ณ„ํ”„๋กœ์ ํŠธ

โœจ ์‹œ๊ฐ์žฅ์• ์ธ์„ ์œ„ํ•œ ๋ฒ„์Šค๋„์ฐฉ ์•Œ๋ฆผ ์žฅ์น˜ โœจ ๐Ÿ‘€ ๊ฐœ์š” ํ˜„๋Œ€ ์‚ฌํšŒ์—์„œ ๋Œ€์ค‘๊ตํ†ต ์œ„์น˜ ์ •๋ณด๋ฅผ ์ด์šฉํ•˜์—ฌ ์‚ฌ๋žŒ๋“ค์ด ๊ฐ„๋‹จํ•˜๊ฒŒ ์ด์šฉํ•  ๋Œ€์ค‘๊ตํ†ต์˜ ์ •๋ณด๋ฅผ ์–ป๊ณ  ์‰ฝ๊ฒŒ ๋Œ€์ค‘๊ตํ†ต์„ ์ด์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. ํ•ด๋‹น ์ •๋ณด๋Š” ๊ฐ์ข… ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๊ณผ ๋Œ€์ค‘๊ตํ†ต ์ด์šฉ์‹œ์„ค์—์„œ ์œ„์น˜ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•˜๊ณ  ์žˆ์ง€๋งŒ ์‹œ๊ฐ

taegyun 3 Jan 25, 2022
RIDE automatically creates the package and boilerplate OOP Python node scripts as per your needs

RIDE: ROS IDE RIDE automatically creates the package and boilerplate OOP Python code for nodes as per your needs (RIDE is not an IDE, but even ROS isn

Jash Mota 20 Jul 14, 2022
Official code for "Parser-Free Virtual Try-on via Distilling Appearance Flows", CVPR 2021

Parser-Free Virtual Try-on via Distilling Appearance Flows, CVPR 2021 Official code for CVPR 2021 paper 'Parser-Free Virtual Try-on via Distilling App

395 Jan 03, 2023
This is a modification of the OpenAI-CLIP repository of moein-shariatnia

This is a modification of the OpenAI-CLIP repository of moein-shariatnia

Sangwon Beak 2 Mar 04, 2022
Wikipedia-Utils: Preprocessing Wikipedia Texts for NLP

Wikipedia-Utils: Preprocessing Wikipedia Texts for NLP This repository maintains some utility scripts for retrieving and preprocessing Wikipedia text

Masatoshi Suzuki 44 Oct 19, 2022
Code for "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022.

README Code for Two-stage Identifier: "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022. For details of the model a

Yongliang Shen 45 Nov 29, 2022
TensorFlow code and pre-trained models for BERT

BERT ***** New March 11th, 2020: Smaller BERT Models ***** This is a release of 24 smaller BERT models (English only, uncased, trained with WordPiece

Google Research 32.9k Jan 08, 2023
Baseline code for Korean open domain question answering(ODQA)

Open-Domain Question Answering(ODQA)๋Š” ๋‹ค์–‘ํ•œ ์ฃผ์ œ์— ๋Œ€ํ•œ ๋ฌธ์„œ ์ง‘ํ•ฉ์œผ๋กœ๋ถ€ํ„ฐ ์ž์—ฐ์–ด ์งˆ์˜์— ๋Œ€ํ•œ ๋‹ต๋ณ€์„ ์ฐพ์•„์˜ค๋Š” task์ž…๋‹ˆ๋‹ค. ์ด๋•Œ ์‚ฌ์šฉ์ž ์งˆ์˜์— ๋‹ต๋ณ€ํ•˜๊ธฐ ์œ„ํ•ด ์ฃผ์–ด์ง€๋Š” ์ง€๋ฌธ์ด ๋”ฐ๋กœ ์กด์žฌํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์‚ฌ์ „์— ๊ตฌ์ถ•๋˜์–ด์žˆ๋Š” Knowl

VUMBLEB 69 Nov 04, 2022