Weakly-supervised Text Classification Based on Keyword Graph

Last update: Dec 29, 2022

Related tags

Text Data & NLP ClassKG

Overview

Weakly-supervised Text Classification Based on Keyword Graph

How to run?

Download data

Our dataset follows previous works. For long texts, we follow Conwea. For short texts, we follow LOTClass.
We transform all their data into unified json format.

Download datasets from: https://drive.google.com/drive/folders/1D8E9T-vuBE-YdAd9OBy-yS4UW4AptA58?usp=sharing
- Long text datasets(follow Conwea):
  - 20Newsgroup Fine(20NF)
  - 20Newsgroup Coarse(20NC)
  - NYT Fine(NYT_25)
  - NYT Coarse(NYT_5)
- Short text datasets(follow LOTClass)
  - Agnews
  - dbpedia
  - imdb
  - amazon
Unzip data into './data/processed'

Another way to obtain data (Not recommended):
You can download long text data from Conwea and short text data from LOTClass and transform data into json format using our code. The code is located at 'preprocess_data/process_long.py (process_short.py) You need to edit the preprocess code to change the dataset path to your downloaded path and change the taskname. The processed data is located in 'data/processed'. We alse provide preprocess code for X-class, which is 'process_x_class.py'.

Requirements

This project is based on python==3.8. The dependencies are as follow:

pytorch
DGL
yacs
visdom
transformers
scikit-learn
numpy
scipy

Train and Eval

Recommend to start visdom to show the results.

visdom -p 8888

Open the browser to the server_ip:8888 to show visdom panel.

Train:
- First edit 'task/pipeline.py' to specify to config file and CUDA devices you used.
  Some configuration files are provided in the config folder.
- Start training:
```
python task/pipeline.py
```
- Our code is based on multi GPUs, may be unable to run on single GPU currently.

Run on your custom dataset.

provide datasets to dir data/processed.
- keywords.json
  keywords for each class. type: dict. key: class_index. value: list containing all keywords for this class. See provided datasets for details.
- unlabeled.json
  unlabeled sentences in our paper. type: list. item: list with 2 items([sentence_i,label_i]).
  In order to facilitate the evaluation, we are similar to Conwea's settings, where labels of sentences are provided. The labels are only used for evaluation.
provide config to dir config. You can copy one of the existing config files and change some fields, like number_classes, classifier.type, data_dir_name etc.
Specify the config file name in pipeline.py and run the pipeline code.

Citation

Please cite the following paper if you find our code helpful! Thank you very much.

Lu Zhang, Jiandong Ding, Yi Xu, Yingyao Liu and Shuigeng Zhou. "Weakly-supervised Text Classification Based on Keyword Graph". EMNLP 2021.

Weakly-supervised Text Classification Based on Keyword Graph

Related tags

Overview

Weakly-supervised Text Classification Based on Keyword Graph

How to run?

Download data

Requirements

Train and Eval

Run on your custom dataset.

Citation

Owner

Hello_World

This is a really simple text-to-speech app made with python and tkinter.

Use Google's BERT for named entity recognition （CoNLL-2003 as the dataset）.

This project converts your human voice input to its text transcript and to an automated voice too.

Easy to start. Use deep nerual network to predict the sentiment of movie review.

Pipeline for fast building text classification TF-IDF + LogReg baselines.

ConvBERT: Improving BERT with Span-based Dynamic Convolution

A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。

Quick insights from Zoom meeting transcripts using Graph + NLP

A toolkit for document-level event extraction, containing some SOTA model implementations

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

A notebook that shows how to import the IITB English-Hindi Parallel Corpus from the HuggingFace datasets repository

Perform sentiment analysis and keyword extraction on Craigslist listings

Precision Medicine Knowledge Graph (PrimeKG)

Two-stage text summarization with BERT and BART

Unsupervised text tokenizer focused on computational efficiency

Chinese named entity recognization (bert/roberta/macbert/bert_wwm with Keras)

Kerberoast with ACL abuse capabilities

🤕 spelling exceptions builder for lazy people

ChainKnowledgeGraph, 产业链知识图谱包括A股上市公司、行业和产品共3类实体