Korean Simple Contrastive Learning of Sentence Embeddings using SKT KoBERT and kakaobrain KorNLU dataset

Last update: Nov 24, 2022

Overview

KoSimCSE

Korean Simple Contrastive Learning of Sentence Embeddings implementation using pytorch
- SimCSE

Installation

git clone https://github.com/BM-K/KoSimCSE.git
cd KoSimCSE
git clone https://github.com/SKTBrain/KoBERT.git
cd KoBERT
pip install -r requirements.txt
pip install .
cd ..
pip install -r requirements.txt

Training - only supervised

Model
- SKT KoBERT
Dataset
- kakaobrain NLU dataset
  - train: KorNLI
  - dev & test: KorSTS
Setting
- epochs: 3
- dropout: 0.1
- batch size: 256
- temperature: 0.05
- learning rate: 5e-5
- warm-up ratio: 0.05
- max sequence length: 50
- evaluation steps during training: 250
Run train -> test -> semantic_search

bash run_example.sh

Pre-Trained Models

Using BERT [CLS] token representation
Pre-Trained model check point
- Google Drive Sharing
- ./output/nli_checkpoint.pt

Performance

Model	Cosine Pearson	Cosine Spearman	Euclidean Pearson	Euclidean Spearman	Manhattan Pearson	Manhattan Spearman	Dot Pearson	Dot Spearman
KoSBERT_SKT*	78.81	78.47	77.68	77.78	77.71	77.83	75.75	75.22
KoSimCSE_SKT	81.55	82.11	81.70	81.69	81.65	81.60	78.19	77.18

*: KoSBERT_SKT

Example Downstream Task

Semantic Search

python SemanticSearch.py

import numpy as np
from model.utils import pytorch_cos_sim
from data.dataloader import convert_to_tensor, example_model_setting


def main():
    model_ckpt = './output/nli_checkpoint.pt'
    model, transform, device = example_model_setting(model_ckpt)

    # Corpus with example sentences
    corpus = ['한 남자가 음식을 먹는다.',
              '한 남자가 빵 한 조각을 먹는다.',
              '그 여자가 아이를 돌본다.',
              '한 남자가 말을 탄다.',
              '한 여자가 바이올린을 연주한다.',
              '두 남자가 수레를 숲 속으로 밀었다.',
              '한 남자가 담으로 싸인 땅에서 백마를 타고 있다.',
              '원숭이 한 마리가 드럼을 연주한다.',
              '치타 한 마리가 먹이 뒤에서 달리고 있다.']

    inputs_corpus = convert_to_tensor(corpus, transform)

    corpus_embeddings = model.encode(inputs_corpus, device)

    # Query sentences:
    queries = ['한 남자가 파스타를 먹는다.',
               '고릴라 의상을 입은 누군가가 드럼을 연주하고 있다.',
               '치타가 들판을 가로 질러 먹이를 쫓는다.']

    # Find the closest 5 sentences of the corpus for each query sentence based on cosine similarity
    top_k = 5
    for query in queries:
        query_embedding = model.encode(convert_to_tensor([query], transform), device)
        cos_scores = pytorch_cos_sim(query_embedding, corpus_embeddings)[0]
        cos_scores = cos_scores.cpu().detach().numpy()

        top_results = np.argpartition(-cos_scores, range(top_k))[0:top_k]

        print("\n\n======================\n\n")
        print("Query:", query)
        print("\nTop 5 most similar sentences in corpus:")

        for idx in top_results[0:top_k]:
            print(corpus[idx].strip(), "(Score: %.4f)" % (cos_scores[idx]))

Result

Query: 한 남자가 파스타를 먹는다.

Top 5 most similar sentences in corpus:
한 남자가 음식을 먹는다. (Score: 0.6002)
한 남자가 빵 한 조각을 먹는다. (Score: 0.5938)
치타 한 마리가 먹이 뒤에서 달리고 있다. (Score: 0.0696)
한 남자가 말을 탄다. (Score: 0.0328)
원숭이 한 마리가 드럼을 연주한다. (Score: -0.0048)


======================


Query: 고릴라 의상을 입은 누군가가 드럼을 연주하고 있다.

Top 5 most similar sentences in corpus:
원숭이 한 마리가 드럼을 연주한다. (Score: 0.6489)
한 여자가 바이올린을 연주한다. (Score: 0.3670)
한 남자가 말을 탄다. (Score: 0.2322)
그 여자가 아이를 돌본다. (Score: 0.1980)
한 남자가 담으로 싸인 땅에서 백마를 타고 있다. (Score: 0.1628)


======================


Query: 치타가 들판을 가로 질러 먹이를 쫓는다.

Top 5 most similar sentences in corpus:
치타 한 마리가 먹이 뒤에서 달리고 있다. (Score: 0.7756)
두 남자가 수레를 숲 속으로 밀었다. (Score: 0.1814)
한 남자가 말을 탄다. (Score: 0.1666)
원숭이 한 마리가 드럼을 연주한다. (Score: 0.1530)
한 남자가 담으로 싸인 땅에서 백마를 타고 있다. (Score: 0.1270)

Citing

SimCSE

@article{gao2021simcse,
   title={{SimCSE}: Simple Contrastive Learning of Sentence Embeddings},
   author={Gao, Tianyu and Yao, Xingcheng and Chen, Danqi},
   journal={arXiv preprint arXiv:2104.08821},
   year={2021}
}

KorNLU Datasets

@article{ham2020kornli,
  title={KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language Understanding},
  author={Ham, Jiyeon and Choe, Yo Joong and Park, Kyubyong and Choi, Ilji and Soh, Hyungjoon},
  journal={arXiv preprint arXiv:2004.03289},
  year={2020}
}

Korean Simple Contrastive Learning of Sentence Embeddings using SKT KoBERT and kakaobrain KorNLU dataset

Related tags

Overview

KoSimCSE

Installation

Training - only supervised

Pre-Trained Models

Performance

Example Downstream Task

Semantic Search

Result

Citing

SimCSE

KorNLU Datasets

Owner

A PyTorch implementation of the Transformer model in "Attention is All You Need".

Code for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned Language Models in the wild .

PyABSA - Open & Efficient for Framework for Aspect-based Sentiment Analysis

A fast and easy implementation of Transformer with PyTorch.

Download videos from YouTube/Twitch/Twitter right in the Windows Explorer, without installing any shady shareware apps

Lightweight utility tools for the detection of multiple spellings, meanings, and language-specific terminology in British and American English

Extract Keywords from sentence or Replace keywords in sentences.

🏖 Easy training and deployment of seq2seq models.

Neural network sequence labeling model

Application for shadowing Chinese.

The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

Super Tickets in Pre-Trained Language Models: From Model Compression to Improving Generalization (ACL 2021)

Wake: Context-Sensitive Automatic Keyword Extraction Using Word2vec

DomainWordsDict, Chinese words dict that contains more than 68 domains, which can be used as text classification、knowledge enhance task

中文空间语义理解评测

Language-Agnostic SEntence Representations

MicBot - MicBot uses Google Translate to speak everyone's chat messages

Code for paper Multitask-Finetuning of Zero-shot Vision-Language Models

The SVO-Probes Dataset for Verb Understanding

Code for EMNLP20 paper: "ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training"