Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT.

Overview

KR-BERT-SimCSE

Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT.

Training

Unsupervised

python train_unsupervised.py --mixed_precision

I used Korean Wikipedia Corpus that is divided into sentences in advance. (Check out tfds-korean catalog page for details)

  • Settings
    • KR-BERT character
    • peak learning rate 3e-5
    • batch size 64
    • Total steps: 25,000
    • 0.05 warmup rate, and linear decay learning rate scheduler
    • temperature 0.05
    • evalaute on KLUE STS and KorSTS every 250 steps
    • max sequence length 64
    • Use pooled outputs for training, and [CLS] token's representations for inference

The hyperparameters were not tuned and mostly followed the values in the paper.

Supervised

python train_supervised.py --mixed_precision

I used KorNLI for supervised training. (Check out tfds-korean catalog page)

  • Settings
    • KR-BERT character
    • batch size 128
    • epoch 3
    • peak learning rate 5e-5
    • 0.05 warmup rate, and linear decay learning rate scheduler
    • temperature 0.05
    • evalaute on KLUE STS and KorSTS every 125 steps
    • max sequence length 48
    • Use pooled outputs for training, and [CLS] token's representations for inference

The hyperparameters were not tuned and mostly followed the values in the paper.

Results

KorSTS (dev set results)

model 100 X Spearman correlation
KR-BERT base
SimCSE
unsupervised bi encoding 79.99
KR-BERT base
SimCSE-supervised
trained on KorNLI bi encoding 84.88
SRoBERTa base* unsupervised bi encoding 63.34
SRoBERTa base* trained on KorNLI bi encoding 76.48
SRoBERTa base* trained on KorSTS bi encoding 83.68
SRoBERTa base* trained on KorNLI -> KorSTS bi encoding 83.54
SRoBERTa large* trained on KorNLI bi encoding 77.95
SRoBERTa large* trained on KorSTS bi encoding 84.74
SRoBERTa large* trained on KorNLI -> KorSTS bi encoding 84.21

KorSTS (test set results)

model 100 X Spearman correlation
KR-BERT base
SimCSE
unsupervised bi encoding 73.25
KR-BERT base
SimCSE-supervised
trained on KorNLI bi encoding 80.72
SRoBERTa base* unsupervised bi encoding 48.96
SRoBERTa base* trained on KorNLI bi encoding 74.19
SRoBERTa base* trained on KorSTS bi encoding 78.94
SRoBERTa base* trained on KorNLI -> KorSTS bi encoding 80.29
SRoBERTa large* trained on KorNLI bi encoding 75.46
SRoBERTa large* trained on KorSTS bi encoding 79.55
SRoBERTa large* trained on KorNLI -> KorSTS bi encoding 80.49
SRoBERTa base* trained on KorSTS cross encoding 83.00
SRoBERTa large* trained on KorSTS cross encoding 85.27

KLUE STS (dev set results)

model 100 X Pearson's correlation
KR-BERT base
SimCSE
unsupervised bi encoding 74.45
KR-BERT base
SimCSE-supervised
trained on KorNLI bi encoding 79.42
KR-BERT base* supervised cross encoding 87.50

References

@misc{gao2021simcse,
    title={SimCSE: Simple Contrastive Learning of Sentence Embeddings},
    author={Tianyu Gao and Xingcheng Yao and Danqi Chen},
    year={2021},
    eprint={2104.08821},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
@misc{ham2020kornli,
    title={KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language Understanding},
    author={Jiyeon Ham and Yo Joong Choe and Kyubyong Park and Ilji Choi and Hyungjoon Soh},
    year={2020},
    eprint={2004.03289},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
@misc{park2021klue,
    title={KLUE: Korean Language Understanding Evaluation},
    author={Sungjoon Park and Jihyung Moon and Sungdong Kim and Won Ik Cho and Jiyoon Han and Jangwon Park and Chisung Song and Junseong Kim and Yongsook Song and Taehwan Oh and Joohong Lee and Juhyun Oh and Sungwon Lyu and Younghoon Jeong and Inkwon Lee and Sangwoo Seo and Dongjun Lee and Hyunwoo Kim and Myeonghwa Lee and Seongbo Jang and Seungwon Do and Sunkyoung Kim and Kyungtae Lim and Jongwon Lee and Kyumin Park and Jamin Shin and Seonghyun Kim and Lucy Park and Alice Oh and Jung-Woo Ha and Kyunghyun Cho},
    year={2021},
    eprint={2105.09680},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Owner
Jeong Ukjae
Jeong Ukjae
Persian Bert For Long-Range Sequences

ParsBigBird: Persian Bert For Long-Range Sequences The Bert and ParsBert algorithms can handle texts with token lengths of up to 512, however, many ta

Sajjad Ayoubi 63 Dec 14, 2022
Automated question generation and question answering from Turkish texts using text-to-text transformers

Turkish Question Generation Offical source code for "Automated question generation & question answering from Turkish texts using text-to-text transfor

Open Business Software Solutions 29 Dec 14, 2022
Python code for ICLR 2022 spotlight paper EViT: Expediting Vision Transformers via Token Reorganizations

Expediting Vision Transformers via Token Reorganizations This repository contain

Youwei Liang 101 Dec 26, 2022
Paddlespeech Streaming ASR GUI

Paddlespeech-Streaming-ASR-GUI Introduction A paddlespeech Streaming ASR GUI. Us

Niek Zhen 3 Jan 05, 2022
Research code for "What to Pre-Train on? Efficient Intermediate Task Selection", EMNLP 2021

efficient-task-transfer This repository contains code for the experiments in our paper "What to Pre-Train on? Efficient Intermediate Task Selection".

AdapterHub 26 Dec 24, 2022
Python library for processing Chinese text

SnowNLP: Simplified Chinese Text Processing SnowNLP是一个python写的类库,可以方便的处理中文文本内容,是受到了TextBlob的启发而写的,由于现在大部分的自然语言处理库基本都是针对英文的,于是写了一个方便处理中文的类库,并且和TextBlob

Rui Wang 6k Jan 02, 2023
Training RNNs as Fast as CNNs

News SRU++, a new SRU variant, is released. [tech report] [blog] The experimental code and SRU++ implementation are available on the dev branch which

Tao Lei 14 Dec 12, 2022
A flask application to predict the speech emotion of any .wav file.

This is a speech emotion recognition app. It will allow you to train a modular MLP model with the RAVDESS dataset, and then use that model with a flask application to predict the speech emotion of an

Aryan Vijaywargia 2 Dec 15, 2021
Graph Coloring - Weighted Vertex Coloring Problem

Graph Coloring - Weighted Vertex Coloring Problem This project proposes several local searches and an MCTS algorithm for the weighted vertex coloring

Cyril 1 Jul 08, 2022
SASE : Self-Adaptive noise distribution network for Speech Enhancement with heterogeneous data of Cross-Silo Federated learning

SASE : Self-Adaptive noise distribution network for Speech Enhancement with heterogeneous data of Cross-Silo Federated learning We propose a SASE mode

Tower 1 Nov 20, 2021
Transformers Wav2Vec2 + Parlance's CTCDecodeTransformers Wav2Vec2 + Parlance's CTCDecode

🤗 Transformers Wav2Vec2 + Parlance's CTCDecode Introduction This repo shows how 🤗 Transformers can be used in combination with Parlance's ctcdecode

Patrick von Platen 9 Jul 21, 2022
Nested Named Entity Recognition

Nested Named Entity Recognition Training Dataset: CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark url: https://tianchi.aliyun.

8 Dec 25, 2022
Repositório da disciplina no semestre 2021-2

Avisos! Nenhum aviso! Compiladores 1 Este é o Git da disciplina Compiladores 1. Aqui ficará o material produzido em sala de aula assim como tarefas, w

6 May 13, 2022
NLP, Machine learning

Netflix-recommendation-system NLP, Machine learning About Recommendation algorithms are at the core of the Netflix product. It provides their members

Harshith VH 6 Jan 12, 2022
Stack based programming language that compiles to x86_64 assembly or can alternatively be interpreted in Python

lang lang is a simple stack based programming language written in Python. It can

Christoffer Aakre 1 May 30, 2022
Implementation of TTS with combination of Tacotron2 and HiFi-GAN

Tacotron2-HiFiGAN-master Implementation of TTS with combination of Tacotron2 and HiFi-GAN for Mandarin TTS. Inference In order to inference, we need t

SunLu Z 7 Nov 11, 2022
SAVI2I: Continuous and Diverse Image-to-Image Translation via Signed Attribute Vectors

SAVI2I: Continuous and Diverse Image-to-Image Translation via Signed Attribute Vectors [Paper] [Project Website] Pytorch implementation for SAVI2I. We

Qi Mao 44 Dec 30, 2022
This repo contains simple to use, pretrained/training-less models for speaker diarization.

PyDiar This repo contains simple to use, pretrained/training-less models for speaker diarization. Supported Models Binary Key Speaker Modeling Based o

12 Jan 20, 2022
Use the power of GPT3 to execute any function inside your programs just by giving some doctests

gptrun Don't feel like coding today? Use the power of GPT3 to execute any function inside your programs just by giving some doctests. How is this diff

Roberto Abdelkader Martínez Pérez 11 Nov 11, 2022
Shared code for training sentence embeddings with Flax / JAX

flax-sentence-embeddings This repository will be used to share code for the Flax / JAX community event to train sentence embeddings on 1B+ training pa

Nils Reimers 23 Dec 30, 2022