Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT.

Overview

KR-BERT-SimCSE

Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT.

Training

Unsupervised

python train_unsupervised.py --mixed_precision

I used Korean Wikipedia Corpus that is divided into sentences in advance. (Check out tfds-korean catalog page for details)

  • Settings
    • KR-BERT character
    • peak learning rate 3e-5
    • batch size 64
    • Total steps: 25,000
    • 0.05 warmup rate, and linear decay learning rate scheduler
    • temperature 0.05
    • evalaute on KLUE STS and KorSTS every 250 steps
    • max sequence length 64
    • Use pooled outputs for training, and [CLS] token's representations for inference

The hyperparameters were not tuned and mostly followed the values in the paper.

Supervised

python train_supervised.py --mixed_precision

I used KorNLI for supervised training. (Check out tfds-korean catalog page)

  • Settings
    • KR-BERT character
    • batch size 128
    • epoch 3
    • peak learning rate 5e-5
    • 0.05 warmup rate, and linear decay learning rate scheduler
    • temperature 0.05
    • evalaute on KLUE STS and KorSTS every 125 steps
    • max sequence length 48
    • Use pooled outputs for training, and [CLS] token's representations for inference

The hyperparameters were not tuned and mostly followed the values in the paper.

Results

KorSTS (dev set results)

model 100 X Spearman correlation
KR-BERT base
SimCSE
unsupervised bi encoding 79.99
KR-BERT base
SimCSE-supervised
trained on KorNLI bi encoding 84.88
SRoBERTa base* unsupervised bi encoding 63.34
SRoBERTa base* trained on KorNLI bi encoding 76.48
SRoBERTa base* trained on KorSTS bi encoding 83.68
SRoBERTa base* trained on KorNLI -> KorSTS bi encoding 83.54
SRoBERTa large* trained on KorNLI bi encoding 77.95
SRoBERTa large* trained on KorSTS bi encoding 84.74
SRoBERTa large* trained on KorNLI -> KorSTS bi encoding 84.21

KorSTS (test set results)

model 100 X Spearman correlation
KR-BERT base
SimCSE
unsupervised bi encoding 73.25
KR-BERT base
SimCSE-supervised
trained on KorNLI bi encoding 80.72
SRoBERTa base* unsupervised bi encoding 48.96
SRoBERTa base* trained on KorNLI bi encoding 74.19
SRoBERTa base* trained on KorSTS bi encoding 78.94
SRoBERTa base* trained on KorNLI -> KorSTS bi encoding 80.29
SRoBERTa large* trained on KorNLI bi encoding 75.46
SRoBERTa large* trained on KorSTS bi encoding 79.55
SRoBERTa large* trained on KorNLI -> KorSTS bi encoding 80.49
SRoBERTa base* trained on KorSTS cross encoding 83.00
SRoBERTa large* trained on KorSTS cross encoding 85.27

KLUE STS (dev set results)

model 100 X Pearson's correlation
KR-BERT base
SimCSE
unsupervised bi encoding 74.45
KR-BERT base
SimCSE-supervised
trained on KorNLI bi encoding 79.42
KR-BERT base* supervised cross encoding 87.50

References

@misc{gao2021simcse,
    title={SimCSE: Simple Contrastive Learning of Sentence Embeddings},
    author={Tianyu Gao and Xingcheng Yao and Danqi Chen},
    year={2021},
    eprint={2104.08821},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
@misc{ham2020kornli,
    title={KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language Understanding},
    author={Jiyeon Ham and Yo Joong Choe and Kyubyong Park and Ilji Choi and Hyungjoon Soh},
    year={2020},
    eprint={2004.03289},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
@misc{park2021klue,
    title={KLUE: Korean Language Understanding Evaluation},
    author={Sungjoon Park and Jihyung Moon and Sungdong Kim and Won Ik Cho and Jiyoon Han and Jangwon Park and Chisung Song and Junseong Kim and Yongsook Song and Taehwan Oh and Joohong Lee and Juhyun Oh and Sungwon Lyu and Younghoon Jeong and Inkwon Lee and Sangwoo Seo and Dongjun Lee and Hyunwoo Kim and Myeonghwa Lee and Seongbo Jang and Seungwon Do and Sunkyoung Kim and Kyungtae Lim and Jongwon Lee and Kyumin Park and Jamin Shin and Seonghyun Kim and Lucy Park and Alice Oh and Jung-Woo Ha and Kyunghyun Cho},
    year={2021},
    eprint={2105.09680},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Owner
Jeong Ukjae
Jeong Ukjae
Code for the paper "A Simple but Tough-to-Beat Baseline for Sentence Embeddings".

Code for the paper "A Simple but Tough-to-Beat Baseline for Sentence Embeddings".

1.1k Dec 27, 2022
fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.

fastNLP fastNLP是一款轻量级的自然语言处理(NLP)工具包,目标是快速实现NLP任务以及构建复杂模型。 fastNLP具有如下的特性: 统一的Tabular式数据容器,简化数据预处理过程; 内置多种数据集的Loader和Pipe,省去预处理代码; 各种方便的NLP工具,例如Embedd

fastNLP 2.8k Jan 01, 2023
PocketSphinx is a lightweight speech recognition engine, specifically tuned for handheld and mobile devices, though it works equally well on the desktop

molten A minimal, extensible, fast and productive API framework for Python 3. Changelog: https://moltenframework.com/changelog.html Community: https:/

3.2k Dec 28, 2022
SDL: Synthetic Document Layout dataset

SDL is the project that synthesizes document images. It facilitates multiple-level labeling on document images and can generate in multiple languages.

Sơn Nguyễn 0 Oct 07, 2021
Japanese NLP Library

Japanese NLP Library Back to Home Contents 1 Requirements 1.1 Links 1.2 Install 1.3 History 2 Libraries and Modules 2.1 Tokenize jTokenize.py 2.2 Cabo

Pulkit Kathuria 144 Dec 27, 2022
Code for EmBERT, a transformer model for embodied, language-guided visual task completion.

Code for EmBERT, a transformer model for embodied, language-guided visual task completion.

41 Jan 03, 2023
justCTF [*] 2020 challenges sources

justCTF [*] 2020 This repo contains sources for justCTF [*] 2020 challenges hosted by justCatTheFish. TLDR: Run a challenge with ./run.sh (requires Do

justCatTheFish 25 Dec 27, 2022
Implementation of "Adversarial purification with Score-based generative models", ICML 2021

Adversarial Purification with Score-based Generative Models by Jongmin Yoon, Sung Ju Hwang, Juho Lee This repository includes the official PyTorch imp

15 Dec 15, 2022
Every Google, Azure & IBM text to speech voice for free

TTS-Grabber Quick thing i made about a year ago to download any text with any tts voice, over 630 voices to choose from currently. It will split the i

16 Dec 07, 2022
Based on 125GB of data leaked from Twitch, you can see their monthly revenues from 2019-2021

Twitch Revenues Bu script'i kullanarak istediğiniz yayıncıların, Twitch'den sızdırılan 125 GB'lik veriye dayanarak, 2019-2021 arası aylık gelirlerini

4 Nov 11, 2021
Spooky Skelly For Python

_____ _ _____ _ _ _ | __| ___ ___ ___ | |_ _ _ | __|| |_ ___ | || | _ _ |__ || . || . || . || '

Kur0R1uka 1 Dec 23, 2021
ElasticBERT: A pre-trained model with multi-exit transformer architecture.

This repository contains finetuning code and checkpoints for ElasticBERT. Towards Efficient NLP: A Standard Evaluation and A Strong Baseli

fastNLP 48 Dec 14, 2022
The official repository of the ISBI 2022 KNIGHT Challenge

KNIGHT The official repository holding the data for the ISBI 2022 KNIGHT Challenge About The KNIGHT Challenge asks teams to develop models to classify

Nicholas Heller 4 Jan 22, 2022
A notebook that shows how to import the IITB English-Hindi Parallel Corpus from the HuggingFace datasets repository

We provide a notebook that shows how to import the IITB English-Hindi Parallel Corpus from the HuggingFace datasets repository. The notebook also shows how to segment the corpus using BPE tokenizatio

Computation for Indian Language Technology (CFILT) 9 Oct 13, 2022
Use Google's BERT for named entity recognition (CoNLL-2003 as the dataset).

For better performance, you can try NLPGNN, see NLPGNN for more details. BERT-NER Version 2 Use Google's BERT for named entity recognition (CoNLL-2003

Kaiyinzhou 1.2k Dec 26, 2022
:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

Dedupe Python Library dedupe is a python library that uses machine learning to perform fuzzy matching, deduplication and entity resolution quickly on

Dedupe.io 3.6k Jan 02, 2023
DeBERTa: Decoding-enhanced BERT with Disentangled Attention

DeBERTa: Decoding-enhanced BERT with Disentangled Attention This repository is the official implementation of DeBERTa: Decoding-enhanced BERT with Dis

Microsoft 1.2k Jan 03, 2023
EdiTTS: Score-based Editing for Controllable Text-to-Speech

Official implementation of EdiTTS: Score-based Editing for Controllable Text-to-Speech

Neosapience 99 Jan 02, 2023
Training code for Korean multi-class sentiment analysis

KoSentimentAnalysis Bert implementation for the Korean multi-class sentiment analysis 왜 한국어 감정 다중분류 모델은 거의 없는 것일까?에서 시작된 프로젝트 Environment: Pytorch, Da

Donghoon Shin 3 Dec 02, 2022
Learning Spatio-Temporal Transformer for Visual Tracking

STARK The official implementation of the paper Learning Spatio-Temporal Transformer for Visual Tracking Highlights The strongest performances Tracker

Multimedia Research 485 Jan 04, 2023