Various Algorithms for Short Text Mining

Overview

Short Text Mining in Python

CircleCI GitHub release Documentation Status Updates Python 3 pypi download stars

Introduction

This package shorttext is a Python package that facilitates supervised and unsupervised learning for short text categorization. Due to the sparseness of words and the lack of information carried in the short texts themselves, an intermediate representation of the texts and documents are needed before they are put into any classification algorithm. In this package, it facilitates various types of these representations, including topic modeling and word-embedding algorithms.

Since release 1.5.2, it runs on Python 3.9. Since release 1.5.0, support for Python 3.6 was decommissioned. Since release 1.2.4, it runs on Python 3.8. Since release 1.2.3, support for Python 3.5 was decommissioned. Since release 1.1.7, support for Python 2.7 was decommissioned. Since release 1.0.8, it runs on Python 3.7 with 'TensorFlow' being the backend for keras. Since release 1.0.7, it runs on Python 3.7 as well, but the backend for keras cannot be TensorFlow. Since release 1.0.0, shorttext runs on Python 2.7, 3.5, and 3.6.

Characteristics:

  • example data provided (including subject keywords and NIH RePORT);
  • text preprocessing;
  • pre-trained word-embedding support;
  • gensim topic models (LDA, LSI, Random Projections) and autoencoder;
  • topic model representation supported for supervised learning using scikit-learn;
  • cosine distance classification;
  • neural network classification (including ConvNet, and C-LSTM);
  • maximum entropy classification;
  • metrics of phrases differences, including soft Jaccard score (using Damerau-Levenshtein distance), and Word Mover's distance (WMD);
  • character-level sequence-to-sequence (seq2seq) learning;
  • spell correction;
  • API for word-embedding algorithm for one-time loading; and
  • Sentence encodings and similarities based on BERT.

Documentation

Documentation and tutorials for shorttext can be found here: http://shorttext.rtfd.io/.

See tutorial for how to use the package, and FAQ.

Installation

To install it, in a console, use pip.

>>> pip install -U shorttext

or, if you want the most recent development version on Github, type

>>> pip install -U git+https://github.com/stephenhky/[email protected]

Developers are advised to make sure Keras >=2 be installed. Users are advised to install the backend Tensorflow (preferred) or Theano in advance. It is desirable if Cython has been previously installed too.

See installation guide for more details.

Issues

To report any issues, go to the Issues tab of the Github page and start a thread. It is welcome for developers to submit pull requests on their own to fix any errors.

Contributors

If you would like to contribute, feel free to submit the pull requests. You can talk to me in advance through e-mails or the Issues page.

Useful Links

News

  • 07/11/2021: shorttext 1.5.3 released.
  • 07/06/2021: shorttext 1.5.2 released.
  • 04/10/2021: shorttext 1.5.1 released.
  • 04/09/2021: shorttext 1.5.0 released.
  • 02/11/2021: shorttext 1.4.8 released.
  • 01/11/2021: shorttext 1.4.7 released.
  • 01/03/2021: shorttext 1.4.6 released.
  • 12/28/2020: shorttext 1.4.5 released.
  • 12/24/2020: shorttext 1.4.4 released.
  • 11/10/2020: shorttext 1.4.3 released.
  • 10/18/2020: shorttext 1.4.2 released.
  • 09/23/2020: shorttext 1.4.1 released.
  • 09/02/2020: shorttext 1.4.0 released.
  • 07/23/2020: shorttext 1.3.0 released.
  • 06/05/2020: shorttext 1.2.6 released.
  • 05/20/2020: shorttext 1.2.5 released.
  • 05/13/2020: shorttext 1.2.4 released.
  • 04/28/2020: shorttext 1.2.3 released.
  • 04/07/2020: shorttext 1.2.2 released.
  • 03/23/2020: shorttext 1.2.1 released.
  • 03/21/2020: shorttext 1.2.0 released.
  • 12/01/2019: shorttext 1.1.6 released.
  • 09/24/2019: shorttext 1.1.5 released.
  • 07/20/2019: shorttext 1.1.4 released.
  • 07/07/2019: shorttext 1.1.3 released.
  • 06/05/2019: shorttext 1.1.2 released.
  • 04/23/2019: shorttext 1.1.1 released.
  • 03/03/2019: shorttext 1.1.0 released.
  • 02/14/2019: shorttext 1.0.8 released.
  • 01/30/2019: shorttext 1.0.7 released.
  • 01/29/2019: shorttext 1.0.6 released.
  • 01/13/2019: shorttext 1.0.5 released.
  • 10/03/2018: shorttext 1.0.4 released.
  • 08/06/2018: shorttext 1.0.3 released.
  • 07/24/2018: shorttext 1.0.2 released.
  • 07/17/2018: shorttext 1.0.1 released.
  • 07/14/2018: shorttext 1.0.0 released.
  • 06/18/2018: shorttext 0.7.2 released.
  • 05/30/2018: shorttext 0.7.1 released.
  • 05/17/2018: shorttext 0.7.0 released.
  • 02/27/2018: shorttext 0.6.0 released.
  • 01/19/2018: shorttext 0.5.11 released.
  • 01/15/2018: shorttext 0.5.10 released.
  • 12/14/2017: shorttext 0.5.9 released.
  • 11/08/2017: shorttext 0.5.8 released.
  • 10/27/2017: shorttext 0.5.7 released.
  • 10/17/2017: shorttext 0.5.6 released.
  • 09/28/2017: shorttext 0.5.5 released.
  • 09/08/2017: shorttext 0.5.4 released.
  • 09/02/2017: end of GSoC project. (Report)
  • 08/22/2017: shorttext 0.5.1 released.
  • 07/28/2017: shorttext 0.4.1 released.
  • 07/26/2017: shorttext 0.4.0 released.
  • 06/16/2017: shorttext 0.3.8 released.
  • 06/12/2017: shorttext 0.3.7 released.
  • 06/02/2017: shorttext 0.3.6 released.
  • 05/30/2017: GSoC project (Chinmaya Pancholi, with gensim)
  • 05/16/2017: shorttext 0.3.5 released.
  • 04/27/2017: shorttext 0.3.4 released.
  • 04/19/2017: shorttext 0.3.3 released.
  • 03/28/2017: shorttext 0.3.2 released.
  • 03/14/2017: shorttext 0.3.1 released.
  • 02/23/2017: shorttext 0.2.1 released.
  • 12/21/2016: shorttext 0.2.0 released.
  • 11/25/2016: shorttext 0.1.2 released.
  • 11/21/2016: shorttext 0.1.1 released.

Possible Future Updates

  • Dividing components to other packages;
  • More available corpus.
Comments
  • standalone ?

    standalone ?

    Hi. I have many questions.... :-)

    I'm a beginner for python. Is there any method to run the code standalone ?

    e.g. I trained my data. And I'd like to see the scores on terminal by classifier.score('apple') . The word 'apple' can be changed.

    Thank you regards,

    opened by chocosando 20
  • ImportError: No module named classification_exceptions

    ImportError: No module named classification_exceptions

    import shorttext

    
    ---------------------------------------------------------------------------
    ImportError                               Traceback (most recent call last)
    <ipython-input-5-cb09b3381050> in <module>()
    ----> 1 import shorttext
    
    /usr/local/lib/python2.7/dist-packages/shorttext/__init__.py in <module>()
          5 sys.path.append(thisdir)
          6 
    ----> 7 from . import utils
          8 from . import data
          9 from . import classifiers
    
    /usr/local/lib/python2.7/dist-packages/shorttext/utils/__init__.py in <module>()
          4 from . import textpreprocessing
          5 from .wordembed import load_word2vec_model
    ----> 6 from . import compactmodel_io
          7 
          8 from .textpreprocessing import spacy_tokenize as tokenize
    
    /usr/local/lib/python2.7/dist-packages/shorttext/utils/compactmodel_io.py in <module>()
         13 from functools import partial
         14 
    ---> 15 import utils.classification_exceptions as e
         16 
         17 def removedir(dir):
    
    ImportError: No module named classification_exceptions
    
    
    opened by spate141 11
  • ImportError: dlopen: cannot load any more object with static TLS

    ImportError: dlopen: cannot load any more object with static TLS

    Hi, I got the following error when i import shorttext, how shall i resolve?

    Using TensorFlow backend.

    I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcublas.so.7.5 locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcudnn.so.5 locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcufft.so.7.5 locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcuda.so.1 locally I tensorflow/stream_executor/dso_loader.cc:128] successfully opened CUDA library libcurand.so.7.5 locally Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python2.7/dist-packages/shorttext/init.py", line 7, in from . import utils File "/usr/local/lib/python2.7/dist-packages/shorttext/utils/init.py", line 3, in from . import gensim_corpora File "/usr/local/lib/python2.7/dist-packages/shorttext/utils/gensim_corpora.py", line 2, in from .textpreprocessing import spacy_tokenize as tokenize File "/usr/local/lib/python2.7/dist-packages/shorttext/utils/textpreprocessing.py", line 5, in import spacy File "/usr/local/lib/python2.7/dist-packages/spacy/init.py", line 8, in from . import en, de, zh, es, it, hu, fr, pt, nl, sv, fi, bn, he File "/usr/local/lib/python2.7/dist-packages/spacy/en/init.py", line 4, in from ..language import Language File "/usr/local/lib/python2.7/dist-packages/spacy/language.py", line 12, in from .syntax.parser import get_templates ImportError: dlopen: cannot load any more object with static TLS

    opened by kenyeung128 8
  • extend score to take an array of shorttext

    extend score to take an array of shorttext

    Currently, score takes only a single input and as a result, the method is very slow if you are trying to classify thousands of examples. Is there a way you can generate scores for 10K+ samples at the same time.

    opened by rja172 6
  • Importing problem (not installation) over google colab

    Importing problem (not installation) over google colab

    I am experimenting with the library for the first time. The installation was successful and didn't need any extra steps. however when I started importing the library I got the following error related to keras:

    /usr/local/lib/python3.7/dist-packages/shorttext/generators/bow/AutoEncodingTopicModeling.py in () 8 from gensim.corpora import Dictionary 9 from keras import Input ---> 10 from keras.engine import Model 11 from keras.layers import Dense 12 from scipy.spatial.distance import cosine

    ImportError: cannot import name 'Model' from 'keras.engine' (/usr/local/lib/python3.7/dist-packages/keras/engine/init.py)

    I tried to install keras separately but no improvement. any suggestions would be appreciated.

    opened by yomnamahmoud 6
  • RuntimeWarning: overflow encountered in exp2 topicmodeler.train

    RuntimeWarning: overflow encountered in exp2 topicmodeler.train

    Code: trainclassdict = shorttext.data.nihreports(sample_size=None) topicmodeler = shorttext.generators.LDAModeler() topicmodeler.train(trainclassdict, 128) Error message: /lib/python2.7/site-packages/gensim/models/ldamodel.py:535: RuntimeWarning: overflow encountered in exp2 perwordbound, np.exp2(-perwordbound), len(chunk), corpus_words

    Then the results are variable for topicmodeler.retrieve_topicvec('stem cell research')

    opened by dbonner 6
  • Remove negation terms from stopwords.txt

    Remove negation terms from stopwords.txt

    I noticed that stopwords.txt includes negation terms such as "no" and "not". These terms revert the meaning of a word or a sentence, so they should be preserved in the text data. For example, "not a good idea" would become "good idea" after stopword removal. Therefore, I recommend removing negation terms from the stopword list. Thanks!

    opened by star1327p 5
  • Input to shorttext.generators.LDAModeler()

    Input to shorttext.generators.LDAModeler()

    I was wondering what should be the format of data as input for:

    shorttext.generators.LDAModeler() topicmodeler.train(data, 100)

    Can I feed it with a pandas column? Or it should be in a dictionary format? If a dictionary, what should be the keys? I have a large set of tweets.

    opened by malizad 5
  • from shorttext.classifiers import MaxEntClassifier is it regression?

    from shorttext.classifiers import MaxEntClassifier is it regression?

    seems to be maxent is a fancy word for regression or you do have something special in your maxent? https://www.quora.com/What-is-the-relationship-between-Log-Linear-model-MaxEnt-model-and-Logistic-Regression or https://en.wikipedia.org/wiki/Multinomial_logistic_regression

    Multinomial logistic regression is known by a variety of other names, including polytomous LR,[2][3] multiclass LR, softmax regression, multinomial logit, the maximum entropy (MaxEnt) classifier, and the conditional maximum entropy model.[4]
    
    opened by Sandy4321 5
  • No Python 3.6 support with SciPy 1.6

    No Python 3.6 support with SciPy 1.6

    opened by Dobatymo 4
  • Data nihreports not available anymore

    Data nihreports not available anymore

    Some datasets are not available anymore.

    For example the following: nihtraindata = shorttext.data.nihreports(sample_size=None)

    Error message:

    Downloading...
    Source:  http://storage.googleapis.com/pyshorttext/nih_grant_public/nih_full.csv.zip
    Failure to download file!
    (<class 'urllib.error.HTTPError'>, <HTTPError 404: 'Not Found'>, <traceback object at 0x7f09063ed788>)
    

    Python error:

    HTTPError: HTTP Error 404: Not Found
    
    During handling of the above exception, another exception occurred:
    

    When opening the link the same error appears:

    image

    opened by AlessandroVol23 4
Releases(1.5.8)
Owner
Kwan-Yuet "Stephen" Ho
quantitative research, machine learning, data science, text mining, physics
Kwan-Yuet
基于百度的语音识别,用python实现,pyaudio+pyqt

Speech-recognition 基于百度的语音识别,python3.8(conda)+pyaudio+pyqt+baidu-aip 百度有面向python

J-L 1 Jan 03, 2022
Labelling platform for text using distant supervision

With DataQA, you can label unstructured text documents using rule-based distant supervision.

245 Aug 05, 2022
中文生成式预训练模型

T5 PEGASUS 中文生成式预训练模型,以mT5为基础架构和初始权重,通过类似PEGASUS的方式进行预训练。 详情可见:https://kexue.fm/archives/8209 Tokenizer 我们将T5 PEGASUS的Tokenizer换成了BERT的Tokenizer,它对中文更

410 Jan 03, 2023
숭실대학교 컴퓨터학부 전공종합설계프로젝트

✨ 시각장애인을 위한 버스도착 알림 장치 ✨ 👀 개요 현대 사회에서 대중교통 위치 정보를 이용하여 사람들이 간단하게 이용할 대중교통의 정보를 얻고 쉽게 대중교통을 이용할 수 있다. 해당 정보는 각종 어플리케이션과 대중교통 이용시설에서 위치 정보를 제공하고 있지만 시각

taegyun 3 Jan 25, 2022
A raytrace framework using taichi language

ti-raytrace The code use Taichi programming language Current implement acceleration lvbh disney brdf How to run First config your anaconda workspace,

蕉太狼 73 Dec 11, 2022
A benchmark for evaluation and comparison of various NLP tasks in Persian language.

Persian NLP Benchmark The repository aims to track existing natural language processing models and evaluate their performance on well-known datasets.

Mofid AI 68 Dec 19, 2022
Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents [Project Page] [Paper] [Video] Wenlong Huang1, Pieter Abbee

Wenlong Huang 114 Dec 29, 2022
Espresso: A Fast End-to-End Neural Speech Recognition Toolkit

Espresso Espresso is an open-source, modular, extensible end-to-end neural automatic speech recognition (ASR) toolkit based on the deep learning libra

Yiming Wang 919 Jan 03, 2023
TextAttack 🐙 is a Python framework for adversarial attacks, data augmentation, and model training in NLP

TextAttack 🐙 Generating adversarial examples for NLP models [TextAttack Documentation on ReadTheDocs] About • Setup • Usage • Design About TextAttack

QData 2.2k Jan 03, 2023
A simple tool to update bib entries with their official information (e.g., DBLP or the ACL anthology).

Rebiber: A tool for normalizing bibtex with official info. We often cite papers using their arXiv versions without noting that they are already PUBLIS

(Bill) Yuchen Lin 2k Jan 01, 2023
Easy-to-use CPM for Chinese text generation

CPM 项目描述 CPM(Chinese Pretrained Models)模型是北京智源人工智能研究院和清华大学发布的中文大规模预训练模型。官方发布了三种规模的模型,参数量分别为109M、334M、2.6B,用户需申请与通过审核,方可下载。 由于原项目需要考虑大模型的训练和使用,需要安装较为复杂

382 Jan 07, 2023
This codebase facilitates fast experimentation of differentially private training of Hugging Face transformers.

private-transformers This codebase facilitates fast experimentation of differentially private training of Hugging Face transformers. What is this? Why

Xuechen Li 73 Dec 28, 2022
Training code of Spatial Time Memory Network. Semi-supervised video object segmentation.

Training-code-of-STM This repository fully reproduces Space-Time Memory Networks Performance on Davis17 val set&Weights backbone training stage traini

haochen wang 128 Dec 11, 2022
Data loaders and abstractions for text and NLP

torchtext This repository consists of: torchtext.data: Generic data loaders, abstractions, and iterators for text (including vocabulary and word vecto

3.2k Dec 30, 2022
The source code of HeCo

HeCo This repo is for source code of KDD 2021 paper "Self-supervised Heterogeneous Graph Neural Network with Co-contrastive Learning". Paper Link: htt

Nian Liu 106 Dec 27, 2022
Named Entity Recognition API used by TEI Publisher

TEI Publisher Named Entity Recognition API This repository contains the API used by TEI Publisher's web-annotation editor to detect entities in the in

e-editiones.org 14 Nov 15, 2022
🍊 PAUSE (Positive and Annealed Unlabeled Sentence Embedding), accepted by EMNLP'2021 🌴

PAUSE: Positive and Annealed Unlabeled Sentence Embedding Sentence embedding refers to a set of effective and versatile techniques for converting raw

EQT 21 Dec 15, 2022
Learning to Rewrite for Non-Autoregressive Neural Machine Translation

RewriteNAT This repo provides the code for reproducing our proposed RewriteNAT in EMNLP 2021 paper entitled "Learning to Rewrite for Non-Autoregressiv

Xinwei Geng 20 Dec 25, 2022
This simple Python program calculates a love score based on your and your crush's full names in English

This simple Python program calculates a love score based on your and your crush's full names in English. There is no logic or reason in the calculation behind the love score. The calculation could ha

p.katekomol 1 Jan 24, 2022
Build Text Rerankers with Deep Language Models

Reranker is a lightweight, effective and efficient package for training and deploying deep languge model reranker in information retrieval (IR), question answering (QA) and many other natural languag

Luyu Gao 140 Dec 06, 2022