Natural language Understanding Toolkit

Related tags

Text Data & NLPnut
Overview

Natural language Understanding Toolkit

TOC

Requirements

To install nut you need:

  • Python 2.5 or 2.6
  • Numpy (>= 1.1)
  • Sparsesvd (>= 0.1.4) [1] (only CLSCL)

Installation

To clone the repository run,

git clone git://github.com/pprett/nut.git

To build the extension modules inplace run,

python setup.py build_ext --inplace

Add project to python path,

export PYTHONPATH=$PYTHONPATH:$HOME/workspace/nut

Documentation

CLSCL

An implementation of Cross-Language Structural Correspondence Learning (CLSCL). See [Prettenhofer2010] for a detailed description and [Prettenhofer2011] for more experiments and enhancements.

The data for cross-language sentiment classification that has been used in the above study can be found here [2].

clscl_train

Training script for CLSCL. See ./clscl_train --help for further details.

Usage:

$ ./clscl_train en de cls-acl10-processed/en/books/train.processed cls-acl10-processed/en/books/unlabeled.processed cls-acl10-processed/de/books/unlabeled.processed cls-acl10-processed/dict/en_de_dict.txt model.bz2 --phi 30 --max-unlabeled=50000 -k 100 -m 450 --strategy=parallel

|V_S| = 64682
|V_T| = 106024
|V| = 170706
|s_train| = 2000
|s_unlabeled| = 50000
|t_unlabeled| = 50000
debug: DictTranslator contains 5012 translations.
mutualinformation took 5.624 sec
select_pivots took 7.197 sec
|pivots| = 450
create_inverted_index took 59.353 sec
Run joblib.Parallel
[Parallel(n_jobs=-1)]: Done   1 out of 450 |elapsed:    9.1s remaining: 67.8min
[Parallel(n_jobs=-1)]: Done   5 out of 450 |elapsed:   15.2s remaining: 22.6min
[..]
[Parallel(n_jobs=-1)]: Done 449 out of 450 |elapsed: 14.5min remaining:    1.9s
train_aux_classifiers took 881.803 sec
density: 0.1154
Ut.shape = (100,170706)
learn took 903.588 sec
project took 175.483 sec

Note

If you have access to a hadoop cluster, you can use --strategy=hadoop to train the pivot classifiers even faster, however, make sure that the hadoop nodes have Bolt (feature-mask branch) [3] installed.

clscl_predict

Prediction script for CLSCL.

Usage:

$ ./clscl_predict cls-acl10-processed/en/books/train.processed model.bz2 cls-acl10-processed/de/books/test.processed 0.01
|V_S| = 64682
|V_T| = 106024
|V| = 170706
load took 0.681 sec
load took 0.659 sec
classes = {negative,positive}
project took 2.498 sec
project took 2.716 sec
project took 2.275 sec
project took 2.492 sec
ACC: 83.05

Named-Entity Recognition

A simple greedy left-to-right sequence labeling approach to named entity recognition (NER).

pre-trained models

We provide pre-trained named entity recognizers for place, person, and organization names in English and German. To tag a sentence simply use:

>>> from nut.io import compressed_load
>>> from nut.util import WordTokenizer

>>> tagger = compressed_load("model_demo_en.bz2")
>>> tokenizer = WordTokenizer()
>>> tokens = tokenizer.tokenize("Peter Prettenhofer lives in Austria .")

>>> # see tagger.tag.__doc__ for input format
>>> sent = [((token, "", ""), "") for token in tokens]
>>> g = tagger.tag(sent)  # returns a generator over tags
>>> print(" ".join(["/".join(tt) for tt in zip(tokens, g)]))
Peter/B-PER Prettenhofer/I-PER lives/O in/O Austria/B-LOC ./O

You can also use the convenience demo script ner_demo.py:

$ python ner_demo.py model_en_v1.bz2

The feature detector modules for the pre-trained models are en_best_v1.py and de_best_v1.py and can be found in the package nut.ner.features. In addition to baseline features (word presence, shape, pre-/suffixes) they use distributional features (brown clusters), non-local features (extended prediction history), and gazetteers (see [Ratinov2009]). The models have been trained on CoNLL03 [4]. Both models use neither syntactic features (e.g. part-of-speech tags, chunks) nor word lemmas, thus, minimizing the required pre-processing. Both models provide state-of-the-art performance on the CoNLL03 shared task benchmark for English [Ratinov2009]:

processed 46435 tokens with 4946 phrases; found: 4864 phrases; correct: 4455.
accuracy:  98.01%; precision:  91.59%; recall:  90.07%; FB1:  90.83
              LOC: precision:  91.69%; recall:  90.53%; FB1:  91.11  1648
              ORG: precision:  87.36%; recall:  85.73%; FB1:  86.54  1630
              PER: precision:  95.84%; recall:  94.06%; FB1:  94.94  1586

and German [Faruqui2010]:

processed 51943 tokens with 2845 phrases; found: 2438 phrases; correct: 2168.
accuracy:  97.92%; precision:  88.93%; recall:  76.20%; FB1:  82.07
              LOC: precision:  87.67%; recall:  79.83%; FB1:  83.57  957
              ORG: precision:  82.62%; recall:  65.92%; FB1:  73.33  466
              PER: precision:  93.00%; recall:  78.02%; FB1:  84.85  1015

To evaluate the German model on the out-domain data provided by [Faruqui2010] use the raw flag (-r) to write raw predictions (without B- and I- prefixes):

./ner_predict -r model_de_v1.bz2 clner/de/europarl/test.conll - | clner/scripts/conlleval -r
loading tagger... [done]
use_eph:  True
use_aso:  False
processed input in 40.9214s sec.
processed 110405 tokens with 2112 phrases; found: 2930 phrases; correct: 1676.
accuracy:  98.50%; precision:  57.20%; recall:  79.36%; FB1:  66.48
              LOC: precision:  91.47%; recall:  71.13%; FB1:  80.03  563
              ORG: precision:  43.63%; recall:  83.52%; FB1:  57.32  1673
              PER: precision:  62.10%; recall:  83.85%; FB1:  71.36  694

Note that the above results cannot be compared directly to the resuls of [Faruqui2010] since they use a slighly different setting (incl. MISC entity).

ner_train

Training script for NER. See ./ner_train --help for further details.

To train a conditional markov model with a greedy left-to-right decoder, the feature templates of [Rationov2009]_ and extended prediction history (see [Ratinov2009]) use:

./ner_train clner/en/conll03/train.iob2 model_rr09.bz2 -f rr09 -r 0.00001 -E 100 --shuffle --eph
________________________________________________________________________________
Feature extraction

min count:  1
use eph:  True
build_vocabulary took 24.662 sec
feature_extraction took 25.626 sec
creating training examples... build_examples took 42.998 sec
[done]
________________________________________________________________________________
Training

num examples: 203621
num features: 553249
num classes: 9
classes:  ['I-LOC', 'B-ORG', 'O', 'B-PER', 'I-PER', 'I-MISC', 'B-MISC', 'I-ORG', 'B-LOC']
reg: 0.00001000
epochs: 100
9 models trained in 239.28 seconds.
train took 282.374 sec

ner_predict

You can use the prediction script to tag new sentences formatted in CoNLL format and write the output to a file or to stdout. You can pipe the output directly to conlleval to assess the model performance:

./ner_predict model_rr09.bz2 clner/en/conll03/test.iob2 - | clner/scripts/conlleval
loading tagger... [done]
use_eph:  True
use_aso:  False
processed input in 11.2883s sec.
processed 46435 tokens with 5648 phrases; found: 5605 phrases; correct: 4799.
accuracy:  96.78%; precision:  85.62%; recall:  84.97%; FB1:  85.29
              LOC: precision:  87.29%; recall:  88.91%; FB1:  88.09  1699
             MISC: precision:  79.85%; recall:  75.64%; FB1:  77.69  665
              ORG: precision:  82.90%; recall:  78.81%; FB1:  80.80  1579
              PER: precision:  88.81%; recall:  91.28%; FB1:  90.03  1662

References

[1] http://pypi.python.org/pypi/sparsesvd/0.1.4
[2] http://www.webis.de/research/corpora/corpus-webis-cls-10/cls-acl10-processed.tar.gz
[3] https://github.com/pprett/bolt/tree/feature-mask
[4] For German we use the updated version of CoNLL03 by Sven Hartrumpf.
[Prettenhofer2010] Prettenhofer, P. and Stein, B., Cross-language text classification using structural correspondence learning. In Proceedings of ACL '10.
[Prettenhofer2011] Prettenhofer, P. and Stein, B., Cross-lingual adaptation using structural correspondence learning. ACM TIST (to appear). [preprint]
[Ratinov2009] (1, 2, 3) Ratinov, L. and Roth, D., Design challenges and misconceptions in named entity recognition. In Proceedings of CoNLL '09.
[Faruqui2010] (1, 2, 3) Faruqui, M. and Padó S., Training and Evaluating a German Named Entity Recognizer with Semantic Generalization. In Proceedings of KONVENS '10

Developer Notes

  • If you copy a new version of bolt into the externals directory make sure to run cython on the *.pyx files. If you fail to do so you will get a PickleError in multiprocessing.
Owner
Peter Prettenhofer
Peter Prettenhofer
Final Project Bootcamp Zero

The Quest (Pygame) Descripción Este es el repositorio de código The-Quest para el proyecto final Bootcamp Zero de KeepCoding. El juego consiste en la

Seven-z01 1 Mar 02, 2022
Just Another Telegram Ai Chat Bot Written In Python With Pyrogram.

OkaeriChatBot Just another Telegram AI chat bot written in Python using Pyrogram. Requirements Python 3.7 or higher.

Wahyusaputra 2 Dec 23, 2021
TweebankNLP - Pre-trained Tweet NLP Pipeline (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Models + Tweebank-NER

TweebankNLP This repo contains the new Tweebank-NER dataset and off-the-shelf Twitter-Stanza pipeline for state-of-the-art Tweet NLP, as described in

Laboratory for Social Machines 84 Dec 20, 2022
Simple GUI where you can enter an article and get a crisp summarized version.

Text-Summarization-using-TextRank-BART Simple GUI where you can enter an article and get a crisp summarized version. How to run: Clone the repo Instal

Rohit P 4 Sep 28, 2022
txtai: Build AI-powered semantic search applications in Go

txtai: Build AI-powered semantic search applications in Go txtai executes machine-learning workflows to transform data and build AI-powered semantic s

NeuML 49 Dec 06, 2022
code for "AttentiveNAS Improving Neural Architecture Search via Attentive Sampling"

AttentiveNAS: Improving Neural Architecture Search via Attentive Sampling This repository contains PyTorch evaluation code, training code and pretrain

Facebook Research 94 Oct 26, 2022
Unsupervised text tokenizer focused on computational efficiency

YouTokenToMe YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE)

VK.com 847 Dec 19, 2022
Community and sentiment analysis based on tweets

The project has set itself the goal of analyzing the thoughts and interaction of Italian users through the social posts expressed through the Twitter platform on the day of the entry into force of th

3 Nov 17, 2022
A repo for materials relating to the tutorial of CS-332 NLP

CS-332-NLP A repo for materials relating to the tutorial of CS-332 NLP Contents Tutorial 1: Introduction Corpus Regular expression Tokenization Tutori

Alok singh 9 Feb 15, 2022
MMDA - multimodal document analysis

MMDA - multimodal document analysis

AI2 75 Jan 04, 2023
Every Google, Azure & IBM text to speech voice for free

TTS-Grabber Quick thing i made about a year ago to download any text with any tts voice, over 630 voices to choose from currently. It will split the i

16 Dec 07, 2022
Wrapper to display a script output or a text file content on the desktop in sway or other wlroots-based compositors

nwg-wrapper This program is a part of the nwg-shell project. This program is a GTK3-based wrapper to display a script output, or a text file content o

Piotr Miller 94 Dec 27, 2022
NLP techniques such as named entity recognition, sentiment analysis, topic modeling, text classification with Python to predict sentiment and rating of drug from user reviews.

This file contains the following documents sumbited for Baruch CIS9665 group 9 fall 2021. 1. Dataset: drug_reviews.csv 2. python codes for text classi

Aarif Munwar Jahan 2 Jan 04, 2023
The ibet-Prime security token management system for ibet network.

ibet-Prime The ibet-Prime security token management system for ibet network. Features ibet-Prime is an API service that enables the issuance and manag

BOOSTRY 8 Dec 22, 2022
Text to speech converter with GUI made in Python.

Text-to-speech-with-GUI Text to speech converter with GUI made in Python. To run this download the zip file and run the main file or clone this repo.

SidTheMiner 1 Nov 15, 2021
**NSFW** A chatbot based on GPT2-chitchat

DangBot -- 好怪哦,再来一句 卡群怪话bot,powered by GPT2 for Chinese chitchat Training Example: python train.py --lr 5e-2 --epochs 30 --max_len 300 --batch_size 8

Tommy Yang 11 Jul 21, 2022
Fast, DB Backed pretrained word embeddings for natural language processing.

Embeddings Embeddings is a python package that provides pretrained word embeddings for natural language processing and machine learning. Instead of lo

Victor Zhong 212 Nov 21, 2022
Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet

Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet

Amazon Web Services - Labs 1.1k Dec 27, 2022
Lattice methods in TensorFlow

TensorFlow Lattice TensorFlow Lattice is a library that implements constrained and interpretable lattice based models. It is an implementation of Mono

504 Dec 20, 2022
Implementation for paper BLEU: a Method for Automatic Evaluation of Machine Translation

BLEU Score Implementation for paper: BLEU: a Method for Automatic Evaluation of Machine Translation Author: Ba Ngoc from ProtonX BLEU score is a popul

Ngoc Nguyen Ba 6 Oct 07, 2021