Pytorch implementation of paper "Efficient Nearest Neighbor Language Models" (EMNLP 2021)

Overview

Efficient Nearest Neighbor Language Models

This is implementation of the paper:

Efficient Nearest Neighbor Language Models
Junxian He, Graham Neubig, Taylor Berg-Kirkpatrick
EMNLP 2021

This repo implements several techniques to speed up the evaluation of non-parametric, nearest neighbor language models. Specifically, we improve the efficiency along three axes: adaptive retrieval, datastore prunning, and dimension reduction.

Install Dependencies

This repository is largly based on the knnlm repo which is a fork of Fairseq (commit da544b). Please use the exact commit page to determine software requirements for using this code.

git clone [email protected]:jxhe/efficient-knnlm.git

cd efficient-knnlm
pip install --editable .
pip install faiss

Hardware

Experiments for this paper were conducted on machines that contain 32 CPUs, 100GB of RAM, and one NVIDIA 3090 24GB GPU. Saving the Wikitext-103 datastore requires 200GB of disk space. Note that the number of CPUs has a great impact on the speed.

Running Efficient kNNLM

Preparation

Data

We share Fairseq's instructions on how to prepare the data here.

mkdir -p datasets/wikitext-103
cp examples/language_model/wikitext-103/prepare-wikitext-103.sh datasets/wikitext-103

cd datasets/wikitext-103
bash prepare-wikitext-103.sh
cd ../..

TEXT=datasets/wikitext-103
python preprocess.py \
    --only-source \
    --trainpref $TEXT/wiki.train.tokens \
    --validpref $TEXT/wiki.valid.tokens \
    --testpref $TEXT/wiki.test.tokens \
    --destdir data-bin/wikitext-103 \
    --workers 20

Download the language model checkpoint pretrained on WikiText-103

# the model checkpoint link is from the knnlm repo
wget https://nlp.stanford.edu/projects/knnlm/wt103_checkpoint_best.pt -P knnlm_ckpt

Save the datastore

mkdir -p dstore

python eval_lm.py data-bin/wikitext-103 \
    --path knnlm_ckpt/checkpoint_best.pt \
    --sample-break-mode none --max-tokens 3072 \
    --softmax-batch 1024 --gen-subset train \
    --context-window 1536 --tokens-per-sample 1536 \
    --dstore-mmap dstore/dstore --knn-keytype 'last_ffn_input' \
    --dstore-size 103225485 --model-overrides "{'knn_keytype': 'last_ffn_input'}" \
    --save-knnlm-dstore --fp16 --dstore-fp16

Dimension Reduction

# the script applies PCA of dimension 512 by default 
# the PCA hyperparameter can be tuned in this script
# set pca=0 to revert back to the vanilla version
bash ef_knnlm/build_faiss.sh

The faiss index is saved into dstore. Try it out:

bash ef_knnlm/utils_cmd/eval_knnlm.sh \
    -d wikitext-103 \
    -s valid \
    -p dstore/dstore_size103225485_embed1024_fp16 \
    -i dstore/knn.103225485.pca512.m64.index \
    -n 103225485 \

You should already observe a speedup.

Adaptive Retrieval

prepare heldout data to train the retrieval adaptor

# this randomly selects 90% of validation data as the training data to 
# train the retrieval adaptor
bash ef_knnlm/adaptive_retrieval/prepare_heldout.sh wikitext-103

prepare features

bash ef_knnlm/adaptive_retrieval/prepare_feature_pipeline.sh

train

bash ef_knnlm/adaptive_retrieval/train_ar.sh

It saves the retrieval adaptor checkpoints into checkpoint/wikitext-103-valid

evaluation

# the cutoff ratio in adaptive retrieval
# by default we cut off half of the retrieval
cutoff=50

# please change this to the .pt file path observed from the last step
ar_ckpt=xxx

# this hyperparameter needs to be changed if 
# the datastore sizes change (e.g. datastore pruning)
size=103225485

dstore_prefix=dstore/dstore_size${size}_embed1024_fp16
index_file=dstore/knn.${size}.pca512.m64.index

bash ef_knnlm/utils_cmd/eval_knnlm.sh \
    -d wikitext-103 \
    -s test \
    -p ${dstore_prefix} \
    -i ${index_file} \
    -c knnlm_ckpt/wt103_checkpoint_best.pt \
    -n ${size} \
    -f datasets/wikitext-103 \
    -a ctxt,freq,lm_ent,lm_max,fert \
    -u ${cutoff} \
    -h ${ar_ckpt} \
    # -w "True"

Datastore Pruning

precompute all the retrieval results for every record in the datastore:

# It is possible to parallel this operation by change 
# "--start-point" and "--num" arguments so that the training
# data would be splitted into multiple smaller ones. In this case
# the retrieval results would be saved into multiple files
bash ef_knnlm/dstore_compression/save_retrieval_results.sh

The retrieval results are saved into dstore/greedy_merge, other datastore pruning algorithms may be played around using these pre-computed results.

greedy merging

# perform greedy merging to yield a new smaller datastore, 
# and build faiss index from the new datastore
bash ef_knnlm/dstore_compression/merge_compression.sh

The pruned datastore and index are saved into dstore/greedy_merging, replace the previousdstore_prefix/index_file with the new ones to use the pruned the datastore. The option -w "True"needs to be passed to eval_knnlm.sh to read the generated datastore weights file from greedy merging.

Reference

@inproceedings{he2021eff,
title={Efficient Nearest Neighbor Language Models},
author={Junxian He and Graham Neubig and Taylor Berg-Kirkpatrick},
booktitle={Proceedings of EMNLP},
year={2021}
}
Owner
Junxian He
NLP/ML PhD student at CMU
Junxian He
Code that accompanies the paper Semi-supervised Deep Kernel Learning: Regression with Unlabeled Data by Minimizing Predictive Variance

Semi-supervised Deep Kernel Learning This is the code that accompanies the paper Semi-supervised Deep Kernel Learning: Regression with Unlabeled Data

58 Oct 26, 2022
Point-NeRF: Point-based Neural Radiance Fields

Point-NeRF: Point-based Neural Radiance Fields Project Sites | Paper | Primary c

Qiangeng Xu 662 Jan 01, 2023
Official implementation of Self-supervised Image-to-text and Text-to-image Synthesis

Self-supervised Image-to-text and Text-to-image Synthesis This is the official implementation of Self-supervised Image-to-text and Text-to-image Synth

6 Jul 31, 2022
HiFT: Hierarchical Feature Transformer for Aerial Tracking (ICCV2021)

HiFT: Hierarchical Feature Transformer for Aerial Tracking Ziang Cao, Changhong Fu, Junjie Ye, Bowen Li, and Yiming Li Our paper is Accepted by ICCV 2

Intelligent Vision for Robotics in Complex Environment 55 Nov 23, 2022
Novel and high-performance medical image classification pipelines are heavily utilizing ensemble learning strategies

An Analysis on Ensemble Learning optimized Medical Image Classification with Deep Convolutional Neural Networks Novel and high-performance medical ima

14 Dec 18, 2022
Manage the availability of workspaces within Frappe/ ERPNext (sidebar) based on user-roles

Workspace Permissions Manage the availability of workspaces within Frappe/ ERPNext (sidebar) based on user-roles. Features Configure foreach workspace

Patrick.St. 18 Sep 26, 2022
PyTorch Implementation of CycleGAN and SSGAN for Domain Transfer (Minimal)

MNIST-to-SVHN and SVHN-to-MNIST PyTorch Implementation of CycleGAN and Semi-Supervised GAN for Domain Transfer. Prerequites Python 3.5 PyTorch 0.1.12

Yunjey Choi 401 Dec 30, 2022
The FIRST GANs-based omics-to-omics translation framework

OmiTrans Please also have a look at our multi-omics multi-task DL freamwork 👀 : OmiEmbed The FIRST GANs-based omics-to-omics translation framework Xi

Xiaoyu Zhang 6 Dec 14, 2022
A (PyTorch) imbalanced dataset sampler for oversampling low frequent classes and undersampling high frequent ones.

Imbalanced Dataset Sampler Introduction In many machine learning applications, we often come across datasets where some types of data may be seen more

Ming 2k Jan 08, 2023
pytorch, hand(object) detect ,yolo v5,手检测

YOLO V5 物体检测,包括手部检测。 项目介绍 手部检测 手部检测示例如下 : 视频示例: 项目配置 作者开发环境: Python 3.7 PyTorch = 1.5.1 数据集 手部检测数据集 该项目数据集采用 TV-Hand 和 COCO-Hand (COCO-Hand-Big 部分) 进

Eric.Lee 11 Dec 20, 2022
I3-master-layout - Simple master and stack layout script

Simple master and stack layout script | ------ | ----- | | | | | Ma

Tobias S 18 Dec 05, 2022
LaneAF: Robust Multi-Lane Detection with Affinity Fields

LaneAF: Robust Multi-Lane Detection with Affinity Fields This repository contains Pytorch code for training and testing LaneAF lane detection models i

155 Dec 17, 2022
Request execution of Galaxy SARS-CoV-2 variation analysis workflows on input data you provide.

SARS-CoV-2 processing requests Request execution of Galaxy SARS-CoV-2 variation analysis workflows on input data you provide. Prerequisites This autom

useGalaxy.eu 17 Aug 13, 2022
OptNet: Differentiable Optimization as a Layer in Neural Networks

OptNet: Differentiable Optimization as a Layer in Neural Networks This repository is by Brandon Amos and J. Zico Kolter and contains the PyTorch sourc

CMU Locus Lab 428 Dec 24, 2022
Cross-view Transformers for real-time Map-view Semantic Segmentation (CVPR 2022 Oral)

Cross View Transformers This repository contains the source code and data for our paper: Cross-view Transformers for real-time Map-view Semantic Segme

Brady Zhou 363 Dec 25, 2022
A fast and easy to use, moddable, Python based Minecraft server!

PyMine PyMine - The fastest, easiest to use, Python-based Minecraft Server! Features Note: This list is not always up to date, and doesn't contain all

PyMine 144 Dec 30, 2022
Baseline of DCASE 2020 task 4

Couple Learning for SED This repository provides the data and source code for sound event detection (SED) task. The improvement of the Couple Learning

21 Oct 18, 2022
《Truly shift-invariant convolutional neural networks》(2021)

Truly shift-invariant convolutional neural networks [Paper] Authors: Anadi Chaman and Ivan Dokmanić Convolutional neural networks were always assumed

Anadi Chaman 46 Dec 19, 2022
UltraPose: Synthesizing Dense Pose with 1 Billion Points by Human-body Decoupling 3D Model

UltraPose: Synthesizing Dense Pose with 1 Billion Points by Human-body Decoupling 3D Model Official repository for the ICCV 2021 paper: UltraPose: Syn

MomoAILab 92 Dec 21, 2022
Using deep learning model to detect breast cancer.

Breast-Cancer-Detection Breast cancer is the most frequent cancer among women, with around one in every 19 women at risk. The number of cases of breas

1 Feb 13, 2022