Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT.

Overview

KR-BERT-SimCSE

Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT.

Training

Unsupervised

python train_unsupervised.py --mixed_precision

I used Korean Wikipedia Corpus that is divided into sentences in advance. (Check out tfds-korean catalog page for details)

  • Settings
    • KR-BERT character
    • peak learning rate 3e-5
    • batch size 64
    • Total steps: 25,000
    • 0.05 warmup rate, and linear decay learning rate scheduler
    • temperature 0.05
    • evalaute on KLUE STS and KorSTS every 250 steps
    • max sequence length 64
    • Use pooled outputs for training, and [CLS] token's representations for inference

The hyperparameters were not tuned and mostly followed the values in the paper.

Supervised

python train_supervised.py --mixed_precision

I used KorNLI for supervised training. (Check out tfds-korean catalog page)

  • Settings
    • KR-BERT character
    • batch size 128
    • epoch 3
    • peak learning rate 5e-5
    • 0.05 warmup rate, and linear decay learning rate scheduler
    • temperature 0.05
    • evalaute on KLUE STS and KorSTS every 125 steps
    • max sequence length 48
    • Use pooled outputs for training, and [CLS] token's representations for inference

The hyperparameters were not tuned and mostly followed the values in the paper.

Results

KorSTS (dev set results)

model 100 X Spearman correlation
KR-BERT base
SimCSE
unsupervised bi encoding 79.99
KR-BERT base
SimCSE-supervised
trained on KorNLI bi encoding 84.88
SRoBERTa base* unsupervised bi encoding 63.34
SRoBERTa base* trained on KorNLI bi encoding 76.48
SRoBERTa base* trained on KorSTS bi encoding 83.68
SRoBERTa base* trained on KorNLI -> KorSTS bi encoding 83.54
SRoBERTa large* trained on KorNLI bi encoding 77.95
SRoBERTa large* trained on KorSTS bi encoding 84.74
SRoBERTa large* trained on KorNLI -> KorSTS bi encoding 84.21

KorSTS (test set results)

model 100 X Spearman correlation
KR-BERT base
SimCSE
unsupervised bi encoding 73.25
KR-BERT base
SimCSE-supervised
trained on KorNLI bi encoding 80.72
SRoBERTa base* unsupervised bi encoding 48.96
SRoBERTa base* trained on KorNLI bi encoding 74.19
SRoBERTa base* trained on KorSTS bi encoding 78.94
SRoBERTa base* trained on KorNLI -> KorSTS bi encoding 80.29
SRoBERTa large* trained on KorNLI bi encoding 75.46
SRoBERTa large* trained on KorSTS bi encoding 79.55
SRoBERTa large* trained on KorNLI -> KorSTS bi encoding 80.49
SRoBERTa base* trained on KorSTS cross encoding 83.00
SRoBERTa large* trained on KorSTS cross encoding 85.27

KLUE STS (dev set results)

model 100 X Pearson's correlation
KR-BERT base
SimCSE
unsupervised bi encoding 74.45
KR-BERT base
SimCSE-supervised
trained on KorNLI bi encoding 79.42
KR-BERT base* supervised cross encoding 87.50

References

@misc{gao2021simcse,
    title={SimCSE: Simple Contrastive Learning of Sentence Embeddings},
    author={Tianyu Gao and Xingcheng Yao and Danqi Chen},
    year={2021},
    eprint={2104.08821},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
@misc{ham2020kornli,
    title={KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language Understanding},
    author={Jiyeon Ham and Yo Joong Choe and Kyubyong Park and Ilji Choi and Hyungjoon Soh},
    year={2020},
    eprint={2004.03289},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
@misc{park2021klue,
    title={KLUE: Korean Language Understanding Evaluation},
    author={Sungjoon Park and Jihyung Moon and Sungdong Kim and Won Ik Cho and Jiyoon Han and Jangwon Park and Chisung Song and Junseong Kim and Yongsook Song and Taehwan Oh and Joohong Lee and Juhyun Oh and Sungwon Lyu and Younghoon Jeong and Inkwon Lee and Sangwoo Seo and Dongjun Lee and Hyunwoo Kim and Myeonghwa Lee and Seongbo Jang and Seungwon Do and Sunkyoung Kim and Kyungtae Lim and Jongwon Lee and Kyumin Park and Jamin Shin and Seonghyun Kim and Lucy Park and Alice Oh and Jung-Woo Ha and Kyunghyun Cho},
    year={2021},
    eprint={2105.09680},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Owner
Jeong Ukjae
Jeong Ukjae
A PyTorch implementation of the Transformer model in "Attention is All You Need".

Attention is all you need: A Pytorch Implementation This is a PyTorch implementation of the Transformer model in "Attention is All You Need" (Ashish V

Yu-Hsiang Huang 7.1k Jan 05, 2023
AI_Assistant - This is a Python based Voice Assistant.

This is a Python based Voice Assistant. This was programmed to increase my understanding of python and also how the in-general Voice Assistants work.

1 Jan 06, 2022
The Internet Archive Research Assistant - Daily search Internet Archive for new items matching your keywords

The Internet Archive Research Assistant - Daily search Internet Archive for new items matching your keywords

Kay Savetz 60 Dec 25, 2022
Tensorflow implementation of paper: Learning to Diagnose with LSTM Recurrent Neural Networks.

Multilabel time series classification with LSTM Tensorflow implementation of model discussed in the following paper: Learning to Diagnose with LSTM Re

Aaqib 552 Nov 28, 2022
Pipeline for chemical image-to-text competition

BMS-Molecular-Translation Introduction This is a pipeline for Bristol-Myers Squibb – Molecular Translation by Vadim Timakin and Maksim Zhdanov. We got

Maksim Zhdanov 7 Sep 20, 2022
A library for end-to-end learning of embedding index and retrieval model

Poeem Poeem is a library for efficient approximate nearest neighbor (ANN) search, which has been widely adopted in industrial recommendation, advertis

54 Dec 21, 2022
Code for "Finetuning Pretrained Transformers into Variational Autoencoders"

transformers-into-vaes Code for Finetuning Pretrained Transformers into Variational Autoencoders (our submission to NLP Insights Workshop 2021). Gathe

Seongmin Park 22 Nov 26, 2022
Various capabilities for static malware analysis.

Malchive The malchive serves as a compendium for a variety of capabilities mainly pertaining to malware analysis, such as scripts supporting day to da

MITRE Cybersecurity 64 Nov 22, 2022
Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch

COCO LM Pretraining (wip) Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch. They were a

Phil Wang 44 Jul 28, 2022
Words_And_Phrases - Just a repo for useful words and phrases that might come handy in some scenarios. Feel free to add yours

Words_And_Phrases Just a repo for useful words and phrases that might come handy in some scenarios. Feel free to add yours Abbreviations Abbreviation

Subhadeep Mandal 1 Feb 01, 2022
A python package to fine-tune transformer-based models for named entity recognition (NER).

nerblackbox A python package to fine-tune transformer-based language models for named entity recognition (NER). Resources Source Code: https://github.

Felix Stollenwerk 13 Jul 30, 2022
News-Articles-and-Essays - NLP (Topic Modeling and Clustering)

NLP T5 Project proposal Topic Modeling and Clustering of News-Articles-and-Essays Students: Nasser Alshehri Abdullah Bushnag Abdulrhman Alqurashi OVER

2 Jan 18, 2022
Create a machine learning model which will predict if the mortgage will be approved or not based on 5 variables

Mortgage-Application-Analysis Create a machine learning model which will predict if the mortgage will be approved or not based on 5 variables: age, in

1 Jan 29, 2022
Multilingual Emotion classification using BERT (fine-tuning). Published at the WASSA workshop (ACL2022).

XLM-EMO: Multilingual Emotion Prediction in Social Media Text Abstract Detecting emotion in text allows social and computational scientists to study h

MilaNLP 35 Sep 17, 2022
This simple Python program calculates a love score based on your and your crush's full names in English

This simple Python program calculates a love score based on your and your crush's full names in English. There is no logic or reason in the calculation behind the love score. The calculation could ha

p.katekomol 1 Jan 24, 2022
Interactive Jupyter Notebook Environment for using the GPT-3 Instruct API

gpt3-instruct-sandbox Interactive Jupyter Notebook Environment for using the GPT-3 Instruct API Description This project updates an existing GPT-3 san

312 Jan 03, 2023
Code for text augmentation method leveraging large-scale language models

HyperMix Code for our paper GPT3Mix and conducting classification experiments using GPT-3 prompt-based data augmentation. Getting Started Installing P

NAVER AI 47 Dec 20, 2022
AI Assistant for Building Reliable, High-performing and Fair Multilingual NLP Systems

AI Assistant for Building Reliable, High-performing and Fair Multilingual NLP Systems

Microsoft 37 Nov 29, 2022
WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Google Research Datasets 740 Dec 24, 2022
Convolutional Neural Networks for Sentence Classification

Convolutional Neural Networks for Sentence Classification Code for the paper Convolutional Neural Networks for Sentence Classification (EMNLP 2014). R

Yoon Kim 2k Jan 02, 2023