MCSE: Multimodal Contrastive Learning of Sentence Embeddings

This repository contains code and pre-trained models for our NAACL-2022 paper MCSE: Multimodal Contrastive Learning of Sentence Embeddings. If you find this reposity useful, please consider citing our paper.

Contact: Miaoran Zhang (mzhang@lsv.uni-saarland.de)

Pre-trained Models & Results

Model	Avg. STS
flickr-mcse-bert-base-uncased [Google Drive]	77.70
flickr-mcse-roberta-base [Google Drive]	78.44
coco-mcse-bert-base-uncased [Google Drive]	77.08
coco-mcse-roberta-base [Google Drive]	78.17

Note: flickr indicates that models are trained on wiki+flickr, and coco indicates that models are trained on wiki+coco.

Quickstart

Setup

Python 3.9.5
Pytorch 1.7.1
Install other packages:

pip install -r requirements.txt

Data Preparation

Please organize the data directory as following:

REPO ROOT
|
|--data    
|  |--wiki1m_for_simcse.txt  
|  |--flickr_random_captions.txt    
|  |--flickr_resnet.hdf5    
|  |--coco_random_captions.txt    
|  |--coco_resnet.hdf5

Wiki1M

wget https://huggingface.co/datasets/princeton-nlp/datasets-for-simcse/resolve/main/wiki1m_for_simcse.txt

Flickr30k & MS-COCO
You can either download the preprocessed data we used:
(annotation sources: flickr30k-entities and coco).

Or preprocess the data by yourself (take Flickr30k as an example):

Download the flickr30k-entities.
Request access to the flickr-images from here. Note that the use of the images much abide by the Flickr Terms of Use.

Run script:

unzip ${path_to_flickr-entities}/annotations.zip

python preprocess/prepare_flickr.py \
    --flickr_entities_dir ${path_to_flickr-entities}  \  
    --flickr_images_dir ${path_to_flickr-images} \
    --output_dir data/
    --batch_size 32

Train & Evaluation

Prepare the senteval datasets for evaluation:

cd SentEval/data/downstream/
bash download_dataset.sh

Run scripts:
```
# For example:  (more examples are given in scripts/.)
sh scripts/run_wiki_flickr.sh
```
Note: In the paper we run experiments with 5 seeds (0,1,2,3,4). You can find the detailed parameter settings in Appendix.

Acknowledgements

The extremely clear and well organized codebase: SimCSE
SentEval toolkit

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
SentEval		SentEval
data		data
preprocess		preprocess
result		result
scripts		scripts
src		src
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
simcse_to_huggingface.py		simcse_to_huggingface.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SentEval

SentEval

data

data

preprocess

preprocess

result

result

scripts

scripts

src

src

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

simcse_to_huggingface.py

simcse_to_huggingface.py

Repository files navigation

MCSE: Multimodal Contrastive Learning of Sentence Embeddings

Pre-trained Models & Results

Quickstart

Setup

Data Preparation

Train & Evaluation

Acknowledgements

About

Releases

Packages

Languages

License

uds-lsv/MCSE

Folders and files

Latest commit

History

Repository files navigation

MCSE: Multimodal Contrastive Learning of Sentence Embeddings

Pre-trained Models & Results

Quickstart

Setup

Data Preparation

Train & Evaluation

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Languages