Code for the paper "A Simple but Tough-to-Beat Baseline for Sentence Embeddings".

Last update: Dec 27, 2022

Related tags

Overview

SIF

This is the code for the paper "A Simple but Tough-to-Beat Baseline for Sentence Embeddings".

The code is written in python and requires numpy, scipy, pickle, sklearn, theano and the lasagne library. Some functions/classes are based on the code of John Wieting for the paper "Towards Universal Paraphrastic Sentence Embeddings" (Thanks John!). The example data sets are also preprocessed using the code there.

Install

To install all dependencies virtualenv is suggested:

$ virtualenv .env
$ . .env/bin/activate
$ pip install -r requirements.txt

Get started

To get started, cd into the directory examples/ and run demo.sh. It downloads the pretrained GloVe word embeddings, and then runs the scripts:

sif_embedding.py is an demo on how to generate sentence embedding using the SIF weighting scheme,
sim_sif.py and sim_tfidf.py are for the textual similarity tasks in the paper,
supervised_sif_proj.sh is for the supervised tasks in the paper.

Check these files to see the options.

Source code

The code is separated into the following parts:

SIF embedding: involves SIF_embedding.py. The SIF weighting scheme is very simple and is implmented in a few lines.
textual similarity tasks: involves data_io.py, eval.py, and sim_algo.py. data_io provides the code for reading the data, eval is for evaluating the performance, and sim_algo provides the code for our sentence embedding algorithm.
supervised tasks: involves data_io.py, eval.py, train.py, proj_model_sim.py, and proj_model_sentiment.py. train provides the entry for training the models (proj_model_sim is for the similarity and entailment tasks, and proj_model_sentiment is for the sentiment task). Check train.py to see the options.
utilities: includes lasagne_average_layer.py, params.py, and tree.py. These provides utility functions/classes for the above two parts.

References

For technical details and full experimental results, see the paper.

@article{arora2017asimple, 
	author = {Sanjeev Arora and Yingyu Liang and Tengyu Ma}, 
	title = {A Simple but Tough-to-Beat Baseline for Sentence Embeddings}, 
	booktitle = {International Conference on Learning Representations},
	year = {2017}
}

Code for the paper "A Simple but Tough-to-Beat Baseline for Sentence Embeddings".

Related tags

Overview

SIF

Install

Get started

Source code

References

Owner

:hot_pepper: R²SQL: "Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing." (AAAI 2021)

Code examples for my Write Better Python Code series on YouTube.

基于“Seq2Seq+前缀树”的知识图谱问答

DaCy: The State of the Art Danish NLP pipeline using SpaCy

BERTAC (BERT-style transformer-based language model with Adversarially pretrained Convolutional neural network)

pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation

2021语言与智能技术竞赛：机器阅读理解任务

STS Benchmark comprises a selection of the English datasets used in the STS tasks organized in the context of SemEval between 2012 and 2017. The selection of datasets include text from image captions, news headlines and user forums.

[ICCV 2021] Counterfactual Attention Learning for Fine-Grained Visual Categorization and Re-identification

Paddle2.x version AI-Writer

Code for paper "Role-oriented Network Embedding Based on Adversarial Learning between Higher-order and Local Features"

HuggingSound: A toolkit for speech-related tasks based on HuggingFace's tools

PyTorch source code of NAACL 2019 paper "An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models"

CDLA: A Chinese document layout analysis (CDLA) dataset

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

iBOT: Image BERT Pre-Training with Online Tokenizer

Weakly-supervised Text Classification Based on Keyword Graph

Write Alphabet, Words and Sentences with your eyes.

Host your own GPT-3 Discord bot

Creating a chess engine using GPT-3