Textpipe: clean and extract metadata from text

Last update: Nov 21, 2022

Overview

textpipe: clean and extract metadata from text

textpipe is a Python package for converting raw text in to clean, readable text and extracting metadata from that text. Its functionalities include transforming raw text into readable text by removing HTML tags and extracting metadata such as the number of words and named entities from the text.

Vision: the zen of textpipe

Designed for use in production pipelines without adult supervision.
Rechargeable batteries included: provide sane defaults and clear examples to adapt.
A uniform interface with thin wrappers around state-of-the-art NLP packages.
As language-agnostic as possible.
Bring your own models.

Features

Clean raw text by removing HTML and other unreadable constructs
Identify the language of text
Extract the number of words, number of sentences, named entities from a text
Calculate the complexity of a text
Obtain text metadata by specifying a pipeline containing all desired elements
Obtain sentiment (polarity and a subjectivity score)
Generates word counts
Computes minhash for cheap similarity estimation of documents

Installation

It is recommended that you install textpipe using a virtual environment.

First, create your virtual environment using virtualenv or virtualenvwrapper.
Using Venv if your default interpreter is python3.6

python3 -m venv .venv

Using virtualenv.

virtualenv venv -p python3.6

Using virtualenvwrapper

mkvirtualenv textpipe -p python3.6

Install textpipe using pip.

pip install textpipe

Install the required packages using requirements.txt.

pip install -r requirements.txt

A note on spaCy download model requirement

While the requirements.txt file that comes with the package calls for spaCy's en_core_web_sm model, this can be changed depending on the model and language you require for your intended use. See spaCy.io's page on their different models for more information.

Usage example

>>> from textpipe import doc, pipeline
>>> sample_text = 'Sample text! <!DOCTYPE>'
>>> document = doc.Doc(sample_text)
>>> print(document.clean)
'Sample text!'
>>> print(document.language)
'en'
>>> print(document.nwords)
2

>>> pipe = pipeline.Pipeline(['CleanText', 'NWords'])
>>> print(pipe(sample_text))
{'CleanText': 'Sample text!', 'NWords': 3}

In order to extend the existing Textpipe operations with your own proprietary operations;

test_pipe = pipeline.Pipeline(['CleanText', 'NWords'])
def custom_op(doc, context=None, settings=None, **kwargs):
    return 1

custom_argument = {'argument' :1 }
test_pipe.register_operation('CUSTOM_STEP', custom_op)
test_pipe.steps.append(('CUSTOM_STEP', custom_argument ))

Contributing

See CONTRIBUTING for guidelines for contributors.

Changes

0.12.1

Bumps redis, tqdm, pyling

0.12.0

Bumps versions of many dependencies including textacy. Results for keyterm extraction changed.

0.11.9

Exposes arbitrary SpaCy ents properties

0.11.8

Exposes SpaCy's cats attribute

0.11.7

Bumps spaCy and redis versions

0.11.6

Fixes bug where gensim model is not cached in pipeline

0.11.5

Raise TextpipeMissingModelException instead of KeyError

0.11.4

Bumps spaCy and datasketch dependencies

0.11.1

Replaces codacy with pylint on CI
Fixes pylint issues

0.11.0

Adds wrapper around Gensim keyed vectors to construct document embeddings from Redis cache

0.9.0

Adds functionality to compute document embeddings using a Gensim word2vec model

0.8.6

Removes non standard utf chars before detecting language

0.8.5

Bump spaCy to 2.1.3

0.8.4

Fix broken install command

0.8.3

Fix broken install command

0.8.2

Fix copy-paste error in word vector aggregation (#118)

0.8.1

Fixes bugs in several operations that didn't accept kwargs

0.8.0

Bumps Spacy to 2.1

0.7.2

Pins Spacy and Pattern versions (with pinned lxml)

0.7.0

change operation's registry from list to dict
global pipeline data is available across operations via the context kwarg
load custom operations using register_operation in pipeline
custom steps (operations) with arguments

Textpipe: clean and extract metadata from text

Related tags

Overview

textpipe: clean and extract metadata from text

Vision: the zen of textpipe

Features

Installation

A note on spaCy download model requirement

Usage example

Contributing

Changes

Owner

Textpipe

:hot_pepper: R²SQL: "Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing." (AAAI 2021)

A Python script which randomly chooses and prints a file from a directory.

Code Generation using a large neural network called GPT-J

本项目是作者们根据个人面试和经验总结出的自然语言处理(NLP)面试准备的学习笔记与资料，该资料目前包含自然语言处理各领域的面试题积累。

Russian words synonyms and antonyms

Coreference resolution for English, French, German and Polish, optimised for limited training data and easily extensible for further languages

ACL22 paper: Imputing Out-of-Vocabulary Embeddings with LOVE Makes Language Models Robust with Little Cost

A library that integrates huggingface transformers with the world of fastai, giving fastai devs everything they need to train, evaluate, and deploy transformer specific models.

Super easy library for BERT based NLP models

open-information-extraction-system, build open-knowledge-graph(SPO, subject-predicate-object) by pyltp(version==3.4.0)

Mapping a variable-length sentence to a fixed-length vector using BERT model

Toward a Visual Concept Vocabulary for GAN Latent Space, ICCV 2021

This repository implements a brute-force spellchecker utilizing the Damerau-Levenshtein edit distance.

PORORO: Platform Of neuRal mOdels for natuRal language prOcessing

This is a MD5 password/passphrase brute force tool

Official code repository of the paper Linear Transformers Are Secretly Fast Weight Programmers.

Yet Another Compiler Visualizer

A paper list of pre-trained language models (PLMs).

Sentiment Analysis Project using Count Vectorizer and TF-IDF Vectorizer

STT for TorchScript is a port of Coqui STT based on DeepSpeech to PyTorch.

Textpipe: clean and extract metadata from text

Related tags

Overview

textpipe: clean and extract metadata from text

Vision: the zen of textpipe

Features

Installation

A note on spaCy download model requirement

Usage example

Contributing

Changes

Owner

Textpipe

:hot_pepper: R²SQL: "Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing." (AAAI 2021)

A Python script which randomly chooses and prints a file from a directory.

Code Generation using a large neural network called GPT-J

本项目是作者们根据个人面试和经验总结出的自然语言处理(NLP)面试准备的学习笔记与资料，该资料目前包含 自然语言处理各领域的 面试题积累。

Russian words synonyms and antonyms

Coreference resolution for English, French, German and Polish, optimised for limited training data and easily extensible for further languages

ACL22 paper: Imputing Out-of-Vocabulary Embeddings with LOVE Makes Language Models Robust with Little Cost

A library that integrates huggingface transformers with the world of fastai, giving fastai devs everything they need to train, evaluate, and deploy transformer specific models.

Super easy library for BERT based NLP models

open-information-extraction-system, build open-knowledge-graph(SPO, subject-predicate-object) by pyltp(version==3.4.0)

Mapping a variable-length sentence to a fixed-length vector using BERT model

Toward a Visual Concept Vocabulary for GAN Latent Space, ICCV 2021

This repository implements a brute-force spellchecker utilizing the Damerau-Levenshtein edit distance.

PORORO: Platform Of neuRal mOdels for natuRal language prOcessing

This is a MD5 password/passphrase brute force tool

Official code repository of the paper Linear Transformers Are Secretly Fast Weight Programmers.

Yet Another Compiler Visualizer

A paper list of pre-trained language models (PLMs).

Sentiment Analysis Project using Count Vectorizer and TF-IDF Vectorizer

STT for TorchScript is a port of Coqui STT based on DeepSpeech to PyTorch.

本项目是作者们根据个人面试和经验总结出的自然语言处理(NLP)面试准备的学习笔记与资料，该资料目前包含自然语言处理各领域的面试题积累。