Korean stereoypte detector with TUNiB-Electra and K-StereoSet

Overview

Korean Stereotype Detector

Korean stereotype sentence classifier using K-StereoSet with TUNiB-Electra

Web demo

Dataset

  • There are 12,688 human-labeled comments in total.
  • They ar splitted into 10,688 train set, 1000 test set and 1000 valid set
stereotype, anti-stereotype, unrelated, profession, race, gender, religion, comment
1.0,    0.0,    0.0,    1.0,    0.0,    0.0,    0.0,    축구선수의 오랜 경력의 전형적인 결과는 부상을 입은 발목과 두개골이다.
1.0,    0.0,    0.0,    0.0,    1.0,    0.0,    0.0,    콜롬비아 사람들은 모두 마약 카르텔에 소속되어 있다.
0.0,    1.0,    0.0,    0.0,    1.0,    0.0,    0.0,    그는 무기를 소지한 적이 없다.
0.0,    1.0,    0.0,    0.0,    0.0,    1.0,    0.0,    그는 주말에만 지루할 수 있는 일을 한다.

Detail

stereotype anti-stereotype unrelated profession race gender religion Total
Train 3,550 3,556 3,581 4,140 4,896 1,268 383 10,688
Valid 341 347 312 410 435 110 45 1,000
Test 334 324 336 361 483 113 43 1,000

Score

precision recall F1
stereotype 0.814 0.601 0.691
anti-stereotype 0.894 0.509 0.648
unrelated 0.872 0.870 0.871
profession 0.943 0.711 0.811
race 0.787 0.907 0.843
gender 0.639 0.836 0.724
religion 0.724 1.0 0.840
total (macro score) 0.810 0.776 0.775

Usage

  • training
python3 train.py --model_name tunib/electra-ko-base \
                 --data_dir YOUR_PATH \
                 --batch_size BATCH_SIZE \
  • threshold optimizing
python3 threshold.py --model_name tunib/electra-ko-base \
                     --data_dir YOUR_CKPT_DIR_PATH \
                     --file_path YOUR_CKPT_FILE_NAME \
                     --batch_size BATCH_SIZE \
                     --data_path TEST_DATA_PATH
  • test
python3 score.py --model_name tunib/electra-ko-base \
                 --data_dir YOUR_CKPT_DIR_PATH \
                 --file_path YOUR_CKPT_FILE_NAME \
                 --batch_size BATCH_SIZE \
                 --data_path TEST_DATA_PATH
Owner
Sae_Chan_Oh
Schrödingers Katze
Sae_Chan_Oh
📜 GPT-2 Rhyming Limerick and Haiku models using data augmentation

Well-formed Limericks and Haikus with GPT2 📜 GPT-2 Rhyming Limerick and Haiku models using data augmentation In collaboration with Matthew Korahais &

Bardia Shahrestani 2 May 26, 2022
Code for "Finetuning Pretrained Transformers into Variational Autoencoders"

transformers-into-vaes Code for Finetuning Pretrained Transformers into Variational Autoencoders (our submission to NLP Insights Workshop 2021). Gathe

Seongmin Park 22 Nov 26, 2022
Deep Learning Topics with Computer Vision & NLP

Deep learning Udacity Course Deep Learning Topics with Computer Vision & NLP for the AWS Machine Learning Engineer Nanodegree Program Tasks are mostly

Simona Mircheva 1 Jan 20, 2022
Korean Sentence Embedding Repository

Korean-Sentence-Embedding 🍭 Korean sentence embedding repository. You can download the pre-trained models and inference right away, also it provides

80 Jan 02, 2023
MRC approach for Aspect-based Sentiment Analysis (ABSA)

B-MRC MRC approach for Aspect-based Sentiment Analysis (ABSA) Paper: Bidirectional Machine Reading Comprehension for Aspect Sentiment Triplet Extracti

Phuc Phan 1 Apr 05, 2022
A PyTorch Implementation of End-to-End Models for Speech-to-Text

speech Speech is an open-source package to build end-to-end models for automatic speech recognition. Sequence-to-sequence models with attention, Conne

Awni Hannun 647 Dec 25, 2022
Code for the paper: Sequence-to-Sequence Learning with Latent Neural Grammars

Code for the paper: Sequence-to-Sequence Learning with Latent Neural Grammars

Yoon Kim 43 Dec 23, 2022
Library for Russian imprecise rhymes generation

TOM RHYMER Library for Russian imprecise rhymes generation. Quick Start Generate rhymes by any given rhyme scheme (aabb, abab, aaccbb, etc ...): from

Alexey Karnachev 6 Oct 18, 2022
spaCy plugin for Transformers , Udify, ELmo, etc.

Camphr - spaCy plugin for Transformers, Udify, Elmo, etc. Camphr is a Natural Language Processing library that helps in seamless integration for a wid

342 Nov 21, 2022
Trains an OpenNMT PyTorch model and SentencePiece tokenizer.

Trains an OpenNMT PyTorch model and SentencePiece tokenizer. Designed for use with Argos Translate and LibreTranslate.

Argos Open Tech 61 Dec 13, 2022
keras implement of transformers for humans

keras implement of transformers for humans

苏剑林(Jianlin Su) 4.8k Jan 03, 2023
Code for "Generating Disentangled Arguments with Prompts: a Simple Event Extraction Framework that Works"

GDAP The code of paper "Code for "Generating Disentangled Arguments with Prompts: a Simple Event Extraction Framework that Works"" Event Datasets Prep

45 Oct 29, 2022
This repository will contain the code for the CVPR 2021 paper "GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields"

GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields Project Page | Paper | Supplementary | Video | Slides | Blog | Talk If

1.1k Dec 27, 2022
Grapheme-to-phoneme (G2P) conversion is the process of generating pronunciation for words based on their written form.

Neural G2P to portuguese language Grapheme-to-phoneme (G2P) conversion is the process of generating pronunciation for words based on their written for

fluz 11 Nov 16, 2022
LUKE -- Language Understanding with Knowledge-based Embeddings

LUKE (Language Understanding with Knowledge-based Embeddings) is a new pre-trained contextualized representation of words and entities based on transf

Studio Ousia 587 Dec 30, 2022
This repository contains the code for "Generating Datasets with Pretrained Language Models".

Datasets from Instructions (DINO 🦕 ) This repository contains the code for Generating Datasets with Pretrained Language Models. The paper introduces

Timo Schick 154 Jan 01, 2023
Predict the spans of toxic posts that were responsible for the toxic label of the posts

toxic-spans-detection An attempt at the SemEval 2021 Task 5: Toxic Spans Detection. The Toxic Spans Detection task of SemEval2021 required participant

Ilias Antonopoulos 3 Jul 24, 2022
Simple virtual assistant using pyttsx3 and speech recognition optionally with pywhatkit and pther libraries.

VirtualAssistant Simple virtual assistant using pyttsx3 and speech recognition optionally with pywhatkit and pther libraries. Third Party Libraries us

Logadheep 1 Nov 27, 2021
문장단위로 분절된 나무위키 데이터셋. Releases에서 다운로드 받거나, tfds-korean을 통해 다운로드 받으세요.

Namuwiki corpus 문장단위로 미리 분절된 나무위키 코퍼스. 목적이 LM등에서 사용하기 위한 데이터셋이라, 링크/이미지/테이블 등등이 잘려있습니다. 문장 단위 분절은 kss를 활용하였습니다. 라이선스는 나무위키에 명시된 바와 같이 CC BY-NC-SA 2.0

Jeong Ukjae 16 Apr 02, 2022
Plugin repository for Macast

Macast-plugins Plugin repository for Macast. How to use third-party player plugin Download Macast from GitHub Release. Download the plugin you want fr

109 Jan 04, 2023