KoBERTopic은 BERTopic을 한국어 데이터에 적용할 수 있도록 토크나이저와 BERT를 수정한 코드입니다.

Last update: Jan 03, 2023

Overview

KoBERTopic

모델 소개

KoBERTopic은 BERTopic을 한국어 데이터에 적용할 수 있도록 토크나이저와 BERT를 수정했습니다.

기존 BERTopic : https://github.com/MaartenGr/BERTopic/tree/05a6790b21009d1704e912e0d9ae22290694cfed
토크나이저로는 형태소 분석기 Mecab을 사용.
BERT로는 다국어 SBERT인 'sentence-transformers/xlm-r-100langs-bert-base-nli-stsb-mean-tokens'를 사용.
토픽의 수는 임의로 50으로 결정.
별도 불용어 제거 등의 추가 전처리는 진행하지 않았음. (진행할 경우 더 좋은 결과를 얻을 수 있을 것으로 기대.)

시각화 결과

참고) BERT 기반의 한국어 토픽 모델 깃허브

BERT 기반의 한국어 토픽 모델에 관심있으시다면 아래의 모델을 추천합니다.
Korean Contextualized Topic Models

Owner

Won Joon Yoo

ML Engineer / Backend Developer

GitHub Repository

Open source code for AlphaFold.

AlphaFold This package provides an implementation of the inference pipeline of AlphaFold v2.0. This is a completely new model that was entered in CASP

9.7k Jan 02, 2023

GNES enables large-scale index and semantic search for text-to-text, image-to-image, video-to-video and any-to-any content form

GNES is Generic Neural Elastic Search, a cloud-native semantic search system based on deep neural network.

1.2k Jan 06, 2023

WikiPron - a command-line tool and Python API for mining multilingual pronunciation data from Wiktionary

WikiPron WikiPron is a command-line tool and Python API for mining multilingual pronunciation data from Wiktionary, as well as a database of pronuncia

213 Jan 01, 2023

This is the 25 + 1 year anniversary version of the 1995 Rachford-Rice contest

Rachford-Rice Contest This is the 25 + 1 year anniversary version of the 1995 Rachford-Rice contest. Can you solve the Rachford-Rice problem for all t

13 Sep 20, 2022

The (extremely) naive sentiment classification function based on NBSVM trained on wisesight_sentiment

thai_sentiment The naive sentiment classification function based on NBSVM trained on wisesight_sentiment วิธีติดตั้ง pip install thai_sentiment==0.1.3

7 Dec 08, 2022

2021语言与智能技术竞赛：机器阅读理解任务

LICS2021 MRC 1. 项目&任务介绍本项目基于官方给定的baseline（DuReader-Checklist-BASELINE）进行二次改造，对整个代码框架做了简单的重构，对核心网络结构添加了注释，解耦了数据读取的模块，并添加了阈值确认的功能，一些小的细节也做了改进。本次任务为202

29 Dec 05, 2022

FewCLUE: 为中文NLP定制的小样本学习测评基准

387 Jan 04, 2023

glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end.

Glow-Speak glow-speak is a fast, local, neural text to speech system that uses eSpeak-ng as a text/phoneme front-end. Installation git clone https://g

8 Dec 25, 2022

[NeurIPS 2021] Code for Learning Signal-Agnostic Manifolds of Neural Fields

Learning Signal-Agnostic Manifolds of Neural Fields This is the uncleaned code for the paper Learning Signal-Agnostic Manifolds of Neural Fields. The

60 Dec 12, 2022

Text Normalization（文本正则化）

Text Normalization（文本正则化）任务描述：通过机器学习算法将英文文本的“手写”形式转换成“口语“形式，例如“6ft”转换成“six feet”等实验结果 XGBoost + bag-of-words: 0.99159 XGBoost+Weights+rules：0.99002

0 Feb 26, 2022

Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Expressions.

patterns-finder Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Ex

22 Dec 19, 2022

This repository will contain the code for the CVPR 2021 paper "GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields"

1.1k Dec 27, 2022

A 30000+ Chinese MRC dataset - Delta Reading Comprehension Dataset

Delta Reading Comprehension Dataset 台達閱讀理解資料集 Delta Reading Comprehension Dataset (DRCD) 屬於通用領域繁體中文機器閱讀理解資料集。本資料集期望成為適用於遷移學習之標準中文閱讀理解資料集。本資料集從2,108篇

272 Dec 15, 2022

Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources (NAACL-2021).

Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources Description This is the repository for the paper Unifying Cross-

16 Sep 09, 2022

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.

Tensor2Tensor Tensor2Tensor, or T2T for short, is a library of deep learning models and datasets designed to make deep learning more accessible and ac

12.9k Jan 07, 2023

AI_Assistant - This is a Python based Voice Assistant.

This is a Python based Voice Assistant. This was programmed to increase my understanding of python and also how the in-general Voice Assistants work.

1 Jan 06, 2022

Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing

Introduction Funnel-Transformer is a new self-attention model that gradually compresses the sequence of hidden states to a shorter one and hence reduc

197 Dec 11, 2022

Tools and data for measuring the popularity & growth of various programming languages.

growth-data Tools and data for measuring the popularity & growth of various programming languages. Install the dependencies $ pip install -r requireme

3 Jan 06, 2022

In this repository, I have developed an end to end Automatic speech recognition project. I have developed the neural network model for automatic speech recognition with PyTorch and used MLflow to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry.

End to End Automatic Speech Recognition In this repository, I have developed an end to end Automatic speech recognition project. I have developed the

22 Nov 13, 2022

open-information-extraction-system, build open-knowledge-graph(SPO, subject-predicate-object) by pyltp(version==3.4.0)

中文开放信息抽取系统, open-information-extraction-system, build open-knowledge-graph(SPO, subject-predicate-object) by pyltp(version==3.4.0)

7 Nov 02, 2022