Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration

Last update: Dec 11, 2022

Related tags

Overview

Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration

This is the official repository for the EMNLP 2021 long paper Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration. We provide code for training and evaluating Phrase-BERT in addition to the datasets used in the paper.

Update: the model is also available now on Huggingface thanks to the help from whaleloops and nreimers!

Setup

This repository depends on sentence-BERT version 0.3.3, which you can install from the source using:

>>> git clone https://github.com/UKPLab/sentence-transformers.git --branch v0.3.3
>>> cd sentence-transformers/
>>> pip install -e .

Also you can install sentence-BERT with pip:

>>> pip install sentence-transformers==0.3.3

Quick Start

The following example shows how to use a trained Phrase-BERT model to embed phrases into dense vectors.

First download and unzip our model.

>>> cd 
   
    
>>> wget https://storage.googleapis.com/phrase-bert/phrase-bert/phrase-bert-model.zip
>>> unzip phrase-bert-model.zip -d phrase-bert-model/
>>> rm phrase-bert-model.zip

Then load the Phrase-BERT model through the sentence-BERT interface:

from sentence_transformers import SentenceTransformer
model_path = '
   
    '
model = SentenceTransformer(model_path)

You can compute phrase embeddings using Phrase-BERT as follows:

phrase_list = [ 'play an active role', 'participate actively', 'active lifestyle']
phrase_embs = model.encode( phrase_list )
[p1, p2, p3] = phrase_embs

As in sentence-BERT, the default output is a list of numpy arrays:

for phrase, embedding in zip(phrase_list, phrase_embs):
    print("Phrase:", phrase)
    print("Embedding:", embedding)
    print("")

An example of computing the dot product of phrase embeddings:

import numpy as np
print(f'The dot product between phrase 1 and 2 is: {np.dot(p1, p2)}')
print(f'The dot product between phrase 1 and 3 is: {np.dot(p1, p3)}')
print(f'The dot product between phrase 2 and 3 is: {np.dot(p2, p3)}')

An example of computing cosine similarity of phrase embeddings:

import torch 
from torch import nn
cos_sim = nn.CosineSimilarity(dim=0)
print(f'The cosine similarity between phrase 1 and 2 is: {cos_sim( torch.tensor(p1), torch.tensor(p2))}')
print(f'The cosine similarity between phrase 1 and 3 is: {cos_sim( torch.tensor(p1), torch.tensor(p3))}')
print(f'The cosine similarity between phrase 2 and 3 is: {cos_sim( torch.tensor(p2), torch.tensor(p3))}')

The output should look like:

The dot product between phrase 1 and 2 is: 218.43600463867188
The dot product between phrase 1 and 3 is: 165.48483276367188
The dot product between phrase 2 and 3 is: 160.51708984375
The cosine similarity between phrase 1 and 2 is: 0.8142536282539368
The cosine similarity between phrase 1 and 3 is: 0.6130303144454956
The cosine similarity between phrase 2 and 3 is: 0.584893524646759

Evaluation

Given the lack of a unified phrase embedding evaluation benchmark, we collect the following five phrase semantics evaluation tasks, which are described further in our paper:

Turney [Download ]
BiRD [Download]
PPDB [Download]
PPDB-filtered [Download]
PAWS-short [Download Train-split ] [Download Dev-split ] [Download Test-split ]

Change config/model_path.py with the model path according to your directories and

For evaluation on Turney, run python eval_turney.py
For evaluation on BiRD, run python eval_bird.py

for evaluation on PPDB / PPDB-filtered / PAWS-short, run eval_ppdb_paws.py with:

nohup python  -u eval_ppdb_paws.py \
    --full_run_mode \
    --task 
     
       \
    --data_dir 
      
        \
    --result_dir 
       
         \
    >./output.txt 2>&1 &

Train your own Phrase-BERT

If you would like to go beyond using the pre-trained Phrase-BERT model, you may train your own Phrase-BERT using data from the domain you are interested in. Please refer to phrase-bert/phrase_bert_finetune.py

The datasets we used to fine-tune Phrase-BERT are here: training data csv file and validation data csv file.

To re-produce the trained Phrase-BERT, please run:

export INPUT_DATA_PATH=
   
    
export TRAIN_DATA_FILE=
    
     
export VALID_DATA_FILE=
     
      
export INPUT_MODEL_PATH=bert-base-nli-stsb-mean-tokens 
export OUTPUT_MODEL_PATH=
      
       


python -u phrase_bert_finetune.py \
    --input_data_path $INPUT_DATA_PATH \
    --train_data_file $TRAIN_DATA_FILE \
    --valid_data_file $VALID_DATA_FILE \
    --input_model_path $INPUT_MODEL_PATH \
    --output_model_path $OUTPUT_MODEL_PATH

Citation:

Please cite us if you find this useful:

@inproceedings{phrasebertwang2021,
    author={Shufan Wang and Laure Thompson and Mohit Iyyer},
    Booktitle = {Empirical Methods in Natural Language Processing},
    Year = "2021",
    Title={Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration}
}

Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration

Related tags

Overview

Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration

Setup

Quick Start

Evaluation

Train your own Phrase-BERT

Citation:

Owner

Segmenter - Transformer for Semantic Segmentation

Python code for ICLR 2022 spotlight paper EViT: Expediting Vision Transformers via Token Reorganizations

Lingtrain Aligner — ML powered library for the accurate texts alignment.

InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective

中文空间语义理解评测

Universal Adversarial Triggers for Attacking and Analyzing NLP (EMNLP 2019)

The proliferation of disinformation across social media has led the application of deep learning techniques to detect fake news.

Need: Image Search With Python

Contact Extraction with Question Answering.

This is a NLP based project to extract effective date of the contract from their text files.

The Internet Archive Research Assistant - Daily search Internet Archive for new items matching your keywords

Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

A minimal Conformer ASR implementation adapted from ESPnet.

Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings

Based on 125GB of data leaked from Twitch, you can see their monthly revenues from 2019-2021

DziriBERT: a Pre-trained Language Model for the Algerian Dialect

GPT-3: Language Models are Few-Shot Learners

Easy, fast, effective, and automatic g-code compression!

A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

基于pytorch_rnn的古诗词生成