KoBERTopic은 BERTopic을 한국어 데이터에 적용할 수 있도록 토크나이저와 BERT를 수정한 코드입니다.

Last update: Jan 03, 2023

Overview

KoBERTopic

모델 소개

KoBERTopic은 BERTopic을 한국어 데이터에 적용할 수 있도록 토크나이저와 BERT를 수정했습니다.

기존 BERTopic : https://github.com/MaartenGr/BERTopic/tree/05a6790b21009d1704e912e0d9ae22290694cfed
토크나이저로는 형태소 분석기 Mecab을 사용.
BERT로는 다국어 SBERT인 'sentence-transformers/xlm-r-100langs-bert-base-nli-stsb-mean-tokens'를 사용.
토픽의 수는 임의로 50으로 결정.
별도 불용어 제거 등의 추가 전처리는 진행하지 않았음. (진행할 경우 더 좋은 결과를 얻을 수 있을 것으로 기대.)

시각화 결과

참고) BERT 기반의 한국어 토픽 모델 깃허브

BERT 기반의 한국어 토픽 모델에 관심있으시다면 아래의 모델을 추천합니다.
Korean Contextualized Topic Models

Owner

Won Joon Yoo

ML Engineer / Backend Developer

GitHub Repository

Python functions for summarizing and improving voice dictation input.

Helpmespeak Help me speak uses Python functions for summarizing and improving voice dictation input. Get started with OpenAI gpt-3 OpenAI is a amazing

6 Dec 17, 2022

OpenAI CLIP text encoders for multiple languages!

Multilingual-CLIP OpenAI CLIP text encoders for any language Colab Notebook · Pre-trained Models · Report Bug Overview OpenAI recently released the pa

481 Dec 30, 2022

🎐 a python library for doing approximate and phonetic matching of strings.

jellyfish Jellyfish is a python library for doing approximate and phonetic matching of strings. Written by James Turk 1.8k Dec 21, 2022

Hierarchical unsupervised and semi-supervised topic models for sparse count data with CorEx

Anchored CorEx: Hierarchical Topic Modeling with Minimal Domain Knowledge Correlation Explanation (CorEx) is a topic model that yields rich topics tha

592 Dec 18, 2022

A program that uses real statistics to choose the best times to bet on BloxFlip's crash gamemode

Bloxflip Smart Bet A program that uses real statistics to choose the best times to bet on BloxFlip's crash gamemode. https://bloxflip.com/crash. THIS

43 Jan 05, 2023

InferSent sentence embeddings

InferSent InferSent is a sentence embeddings method that provides semantic representations for English sentences. It is trained on natural language in

2.2k Dec 27, 2022

Malaya-Speech is a Speech-Toolkit library for bahasa Malaysia, powered by Deep Learning Tensorflow.

Malaya-Speech is a Speech-Toolkit library for bahasa Malaysia, powered by Deep Learning Tensorflow. Documentation Proper documentation is available at

151 Jan 05, 2023

Translate - a PyTorch Language Library

NOTE PyTorch Translate is now deprecated, please use fairseq instead. Translate - a PyTorch Language Library Translate is a library for machine transl

775 Dec 24, 2022

Multiple implementations for abstractive text summurization , using google colab

Text Summarization models if you are able to endorse me on Arxiv, i would be more than glad https://arxiv.org/auth/endorse?x=FRBB89 thanks This repo i

463 Dec 26, 2022

Trains an OpenNMT PyTorch model and SentencePiece tokenizer.

Trains an OpenNMT PyTorch model and SentencePiece tokenizer. Designed for use with Argos Translate and LibreTranslate.

61 Dec 13, 2022

DVC-NLP-Simple-usecase

dvc-NLP-simple-usecase DVC NLP project Reference repository: official reference repo DVC STUDIO MY View Bag of Words- Krish Naik TF-IDF- Krish Naik ST

2 Oct 02, 2022

Simple Speech to Text, Text to Speech

Simple Speech to Text, Text to Speech 1. Download Repository Opsi 1 Download repository ini, extract di lokasi yang diinginkan Opsi 2 Jika sudah famil

5 Dec 28, 2021

A high-level yet extensible library for fast language model tuning via automatic prompt search

ruPrompts ruPrompts is a high-level yet extensible library for fast language model tuning via automatic prompt search, featuring integration with Hugg

37 Dec 07, 2022

Stand-alone language identification system

langid.py readme Introduction langid.py is a standalone Language Identification (LangID) tool. The design principles are as follows: Fast Pre-trained

2k Jan 04, 2023

New Modeling The Background CodeBase

Modeling the Background for Incremental Learning in Semantic Segmentation This is the updated official PyTorch implementation of our work: "Modeling t

9 Dec 28, 2022

Py65 65816 - Add support for the 65C816 to py65

Add support for the 65C816 to py65 Py65 (https://github.com/mnaberez/py65) is a

4 Jan 04, 2023

Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet

1.1k Dec 27, 2022

Text Normalization（文本正则化）

Text Normalization（文本正则化）任务描述：通过机器学习算法将英文文本的“手写”形式转换成“口语“形式，例如“6ft”转换成“six feet”等实验结果 XGBoost + bag-of-words: 0.99159 XGBoost+Weights+rules：0.99002

0 Feb 26, 2022

EMNLP'2021: Can Language Models be Biomedical Knowledge Bases?

BioLAMA BioLAMA is biomedical factual knowledge triples for probing biomedical LMs. The triples are collected and pre-processed from three sources: CT

41 Nov 18, 2022

AutoGluon: AutoML for Text, Image, and Tabular Data

AutoML for Text, Image, and Tabular Data AutoGluon automates machine learning tasks enabling you to easily achieve strong predictive performance in yo

5.2k Dec 29, 2022