SNCSE: Contrastive Learning for Unsupervised Sentence Embedding with Soft Negative Samples

Last update: Jan 02, 2023

Related tags

Overview

SNCSE

SNCSE: Contrastive Learning for Unsupervised Sentence Embedding with Soft Negative Samples

This is the repository for SNCSE.

SNCSE aims to alleviate feature suppression in contrastive learning for unsupervised sentence embedding. In the field, feature suppression means the models fail to distinguish and decouple textual similarity and semantic similarity. As a result, they may overestimate the semantic similarity of any pairs with similar textual regardless of the actual semantic difference between them. And the models may underestimate the semantic similarity of pairs with less words in common. (Please refer to Section 5 of our paper for several instances and detailed analysis.) To this end, we propose to take the negation of original sentences as soft negative samples, and introduce them into the traditional contrastive learning framework through bidirectional margin loss (BML). The structure of SNCSE is as follows:

The performance of SNCSE on STS task with different encoders is:

To reproduce above results, please download the files and unzip it to replace the original file folder. Then download the models, modify the file path variables and run:

python bert_prediction.py
python roberta_prediction.py

To train SNCSE, please download the training file, and put it at /SNCSE/data. You can either run:

python generate_soft_negative_samples.py

to generate soft negative samples, or use our files in /Files/soft_negative_samples.txt. Then you may modify and run train_SNCSE.sh.

To evaluate the checkpoints saved during training on the development set of STSB task, please run:

python bert_evaluation.py
python roberta_evaluation.py

Feel free to contact the authors at [email protected] for any questions.

Please cite SNCSE as

{

Hao Wang, Yangguang Li, Zhen Huang, Yong Dou, Lingpeng Kong, Jing Shao.

SNCSE: Contrastive Learning for Unsupervised Sentence Embedding with Soft Negative Samples.

CoRR, abs/2201.05979, 2022.

}

SNCSE: Contrastive Learning for Unsupervised Sentence Embedding with Soft Negative Samples

Related tags

Overview

SNCSE

Owner

Sense-GVT

Source code for the paper "TearingNet: Point Cloud Autoencoder to Learn Topology-Friendly Representations"

Healthsea is a spaCy pipeline for analyzing user reviews of supplementary products for their effects on health.

Negative sampling for solving the unlabeled entity problem in NER. ICLR-2021 paper: Empirical Analysis of Unlabeled Entity Problem in Named Entity Recognition.

Interactive Jupyter Notebook Environment for using the GPT-3 Instruct API

Code of paper: A Recurrent Vision-and-Language BERT for Navigation

Smart discord chatbot integrated with Dialogflow to manage different classrooms and assist in teaching!

2021 2학기 데이터크롤링 기말프로젝트

TTS is a library for advanced Text-to-Speech generation.

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Labelling platform for text using distant supervision

🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.

Uses Google's gTTS module to easily create robo text readin' on command.

Random-Word-Generator - Generates meaningful words from dictionary with given no. of letters and words.

This repo is to provide a list of literature regarding Deep Learning on Graphs for NLP

A raytrace framework using taichi language

Code for producing Japanese GPT-2 provided by rinna Co., Ltd.

👄 The most accurate natural language detection library for Python, suitable for long and short text alike

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

A minimal Conformer ASR implementation adapted from ESPnet.

This repository collects together basic linguistic processing data for using dataset dumps from the Common Voice project