ChirpText is a collection of text processing tools for Python 3.

Overview

ChirpText is a collection of text processing tools for Python 3.

Documentation Status Total alerts Language grade: Python

It is not meant to be a powerful tank like the popular NTLK but a small package which you can pip-install anywhere and write a few lines of code to process textual data.

Main features

  • Simple file data manipulation using an enhanced open() function (txt, gz, binary, etc.)
  • CSV helper functions
  • Parse Japanese text with mecab library (Does not require mecab-python3 package even on Windows, only a binary release (i.e. mecab.exe) is required)
  • Built-in "lite" text annotation formats (texttaglib TTL/CSV and TTL/JSON)
  • Helper functions and useful data for processing English, Japanese, Chinese and Vietnamese.
  • Application configuration files management which can make educated guess about config files' whereabouts
  • Quick text-based report generation

Installation

chirptext is available on PyPI and can be installed using pip

pip install chirptext

Parsing Japanese text

chirptext supports parsing Japanese text using different parsers (mecab, Janome, and igo-python)

>> doc = deko.parse_doc("猫が好きです。\n犬も好きです。") >>> for sent in doc: ... print(sent, sent.tokens.values()) ... 猫が好きです。 ['猫', 'が', '好き', 'です', '。'] 犬も好きです。 ['犬', 'も', '好き', 'です', '。'] ">
>>> from chirptext import deko
>>> sent = deko.parse('猫が好きです。')
>>> sent.tokens
['`猫`<0:1>', '`が`<1:2>', '`好き`<2:4>', '`です`<4:6>', '`。`<6:7>']
>>> sent.tokens.values()
['猫', 'が', '好き', 'です', '。']
>>> sent[0]
`猫`<0:1>
>>> sent[0].pos
'名詞'
>>> sent[1].lemma
'が'
>>> sent[2].reading
'スキ'

# tokenize
>>> deko.tokenize('猫が好きです。')
['猫', 'が', '好き', 'です', '。']

# split sentences
>>> deko.tokenize_sent("猫が好きです。\n犬も好きです。")
['猫が好きです。', '犬も好きです。']

# parse a document (i.e. multiple sentences)
>>> doc = deko.parse_doc("猫が好きです。\n犬も好きです。")
>>> for sent in doc:
...     print(sent, sent.tokens.values())
... 
猫が好きです。 ['猫', 'が', '好き', 'です', '。']
犬も好きです。 ['犬', 'も', '好き', 'です', '。']

Notes: At least one of the following tools must be installed to use chirptext Japanese parsing:

  1. mecab: http://taku910.github.io/mecab/#download
  2. Janome: available on PyPI, install with pip install Janome
  3. igo-python: available on PyPI, install with pip install igo-python

Convenient IO APIs

>>> from chirptext import chio
>>> chio.write_tsv('data/test.tsv', [['a', 'b'], ['c', 'd']])
>>> chio.read_tsv('data/tes.tsv')
[['a', 'b'], ['c', 'd']]

>>> chio.write_file('data/content.tar.gz', 'Support writing to .tar.gz file')
>>> chio.read_file('data/content.tar.gz')
'Support writing to .tar.gz file'

>>> for row in chio.read_tsv_iter('data/test.tsv'):
...     print(row)
... 
['a', 'b']
['c', 'd']

Sample TextReport

# a string report
rp = TextReport()  # by default, TextReport will write to standard output, i.e. terminal
rp = TextReport(TextReport.STDOUT)  # same as above
rp = TextReport('~/tmp/my-report.txt')  # output to a file
rp = TextReport.null()  # ouptut to /dev/null, i.e. nowhere
rp = TextReport.string()  # output to a string. Call rp.content() to get the string
rp = TextReport(TextReport.STRINGIO)  # same as above

# TextReport will close the output stream automatically by using the with statement
with TextReport.string() as rp:
    rp.header("Lorem Ipsum Analysis", level="h0")
    rp.header("Raw", level="h1")
    rp.print(LOREM_IPSUM)
    rp.header("Top 5 most common letters")
    ct.summarise(report=rp, limit=5)
    print(rp.content())

Output

+---------------------------------------------------------------------------------- 
| Lorem Ipsum Analysis 
+---------------------------------------------------------------------------------- 
 
Raw 
------------------------------------------------------------ 
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. 
 
Top 5 most common letters
------------------------------------------------------------ 
i: 42 
e: 37 
t: 32 
o: 29 
a: 29 

Useful links

You might also like...
Paranoid text spacing in Python

pangu.py Paranoid text spacing for good readability, to automatically insert whitespace between CJK (Chinese, Japanese, Korean) and half-width charact

py-trans is a Free Python library for translate text into different languages.

Free Python library to translate text into different languages.

a python package that lets you add custom colors and text formatting to your scripts in a very easy way!
a python package that lets you add custom colors and text formatting to your scripts in a very easy way!

colormate Python script text formatting package What is colormate? colormate is a python library that lets you add text formatting to your scripts, it

Text Summarizationcls app with python

Text Summarizationcls app This is the repo for the Text Summarization AI Project. It makes use of pre-trained Hugging Face models Packages Used The pa

This is a text summarizing tool written in Python
This is a text summarizing tool written in Python

Summarize Written by: Ling Li Ya This is a text summarizing tool written in Python. User Guide Some things to note: The application is accessible here

Simple python program to auto credit your code, text, book, whatever!

Credit Simple python program to auto credit your code, text, book, whatever! Setup First change credit_text to whatever text you would like to credit

Parse Any Text With Python

ParseAnyText A small package to parse strings. What is the work of it? Well It's a module to creates parser that helps to parse a text easily with les

Adventura is an open source Python Text Adventure Engine

Adventura Adventura is an open source Python Text Adventure Engine, Not yet uplo

Skype export archive to text converter for python

Skype export archive to text converter This software utility extracts chat logs

Comments
  • Asking for a new release on PyPi

    Asking for a new release on PyPi

    Hi,

    Version 0.1a18 is a bit outdated, could you update a newer version to PyPi?

    I only need this commit, but since long time is passed i think most of the master change are stable.

    opened by matteofumagalli1275 1
  • Add CodeQL workflow for GitHub code scanning

    Add CodeQL workflow for GitHub code scanning

    Hi letuananh/chirptext!

    This is a one-off automatically generated pull request from LGTM.com :robot:. You might have heard that we’ve integrated LGTM’s underlying CodeQL analysis engine natively into GitHub. The result is GitHub code scanning!

    With LGTM fully integrated into code scanning, we are focused on improving CodeQL within the native GitHub code scanning experience. In order to take advantage of current and future improvements to our analysis capabilities, we suggest you enable code scanning on your repository. Please take a look at our blog post for more information.

    This pull request enables code scanning by adding an auto-generated codeql.yml workflow file for GitHub Actions to your repository — take a look! We tested it before opening this pull request, so all should be working :heavy_check_mark:. In fact, you might already have seen some alerts appear on this pull request!

    Where needed and if possible, we’ve adjusted the configuration to the needs of your particular repository. But of course, you should feel free to tweak it further! Check this page for detailed documentation.

    Questions? Check out the FAQ below!

    FAQ

    Click here to expand the FAQ section

    How often will the code scanning analysis run?

    By default, code scanning will trigger a scan with the CodeQL engine on the following events:

    • On every pull request — to flag up potential security problems for you to investigate before merging a PR.
    • On every push to your default branch and other protected branches — this keeps the analysis results on your repository’s Security tab up to date.
    • Once a week at a fixed time — to make sure you benefit from the latest updated security analysis even when no code was committed or PRs were opened.

    What will this cost?

    Nothing! The CodeQL engine will run inside GitHub Actions, making use of your unlimited free compute minutes for public repositories.

    What types of problems does CodeQL find?

    The CodeQL engine that powers GitHub code scanning is the exact same engine that powers LGTM.com. The exact set of rules has been tweaked slightly, but you should see almost exactly the same types of alerts as you were used to on LGTM.com: we’ve enabled the security-and-quality query suite for you.

    How do I upgrade my CodeQL engine?

    No need! New versions of the CodeQL analysis are constantly deployed on GitHub.com; your repository will automatically benefit from the most recently released version.

    The analysis doesn’t seem to be working

    If you get an error in GitHub Actions that indicates that CodeQL wasn’t able to analyze your code, please follow the instructions here to debug the analysis.

    How do I disable LGTM.com?

    If you have LGTM’s automatic pull request analysis enabled, then you can follow these steps to disable the LGTM pull request analysis. You don’t actually need to remove your repository from LGTM.com; it will automatically be removed in the next few months as part of the deprecation of LGTM.com (more info here).

    Which source code hosting platforms does code scanning support?

    GitHub code scanning is deeply integrated within GitHub itself. If you’d like to scan source code that is hosted elsewhere, we suggest that you create a mirror of that code on GitHub.

    How do I know this PR is legitimate?

    This PR is filed by the official LGTM.com GitHub App, in line with the deprecation timeline that was announced on the official GitHub Blog. The proposed GitHub Action workflow uses the official open source GitHub CodeQL Action. If you have any other questions or concerns, please join the discussion here in the official GitHub community!

    I have another question / how do I get in touch?

    Please join the discussion here to ask further questions and send us suggestions!

    opened by lgtm-com[bot] 0
  • Revamp TTL APIs for more complex usecases

    Revamp TTL APIs for more complex usecases

    • simplify multi-tag handling (i.e. sense candidates, chunk languages, annotators, etc.)
    • Built-in support for CoNLL
    • use first tag slot for scalar tags (i.e. POS, lemma, surface, languages)
    • Re-design TTL JSON
    opened by letuananh 3
  • Add support for Leipzig and Penn Treebank tagset

    Add support for Leipzig and Penn Treebank tagset

    • Leipzig
      • Reference: https://www.eva.mpg.de/lingua/resources/glossing-rules.php
    • Penn Treebank tag set
      • Version 1: https://www.sketchengine.eu/penn-treebank-tagset/
      • Version 2: https://www.sketchengine.eu/english-treetagger-pipeline-2
    enhancement 
    opened by letuananh 0
Releases(0.2a2)
  • 0.2a2(May 20, 2021)

    Changes

    • Added missing keyword arguments newline and encoding to chio.write_csv and chio.write_tsv
    • Updated test cases

    PyPI link: https://pypi.org/project/chirptext/0.2.a2/

    Source code(tar.gz)
    Source code(zip)
  • chirptext-0.1.2(May 20, 2021)

    chirptext 0.1.2 stable maintenance release for supporting texttaglib legacy APIs

    Changes:

    • [v0.1.2] Added missing keyword arguments newline and encoding to chio.write_csv and chio.write_tsv
    • [v0.1.2] Updated test cases

    PyPI link: https://pypi.org/project/chirptext/0.1.2/

    To use Japanese parsing with chirptext, see chirptext 0.1.1 stable release

    Source code(tar.gz)
    Source code(zip)
  • 0.2a1(May 17, 2021)

  • chirptext-0.1.1(May 17, 2021)

  • chirptext-0.1(May 13, 2021)

  • 0.1rc1(May 2, 2021)

  • 0.1a21(Apr 23, 2021)

  • 0.1a19(Jun 1, 2020)

    • Improved texttaglib (lite) module
      • Better TTL-JSON support
      • Standardized TTL access methods (find(), find_all() to get_tag(), get_tags())
    • Improved chirptext.sino module (Kangxi radical information)
    • Rename TextReport.file to TextReport.stream (more intuitive)
    • Show fewer mecab related warnings
    • Use Markdown for PyPI project README file
    Source code(tar.gz)
    Source code(zip)
  • 0.1a18(Jul 18, 2018)

  • 0.1a14(Apr 11, 2018)

    Deko can be used without mecab-python3 with this release.

    from chirptext import deko
    deko.set_mecab_bin("C:\\mecab\\bin\\mecab.exe")
    # Now we can use deko as usual
    sent = deko.txt2mecab("雨が降る。")
    print(sent.words)
    print(sent[0].pos)
    
    Source code(tar.gz)
    Source code(zip)
  • 0.1a11(Apr 2, 2018)

    • Added TxtWriter and TxtReader to texttaglib module for faster reading
    • Added DataObject to anhxa
    • Deko documents and sentences can be exported to TTL format
    • etc.
    Source code(tar.gz)
    Source code(zip)
  • 0.1a4(Feb 5, 2018)

    • Made WebHelper accept string as path to cache DB
    • Added WebHelper.fetch_json() method
    • Added some bug fixes
    • Added README file with some code samples
    Source code(tar.gz)
    Source code(zip)
  • 0.1a2(Jan 24, 2018)

Owner
Le Tuan Anh
computational linguist, semanticist, deeply interested in well-being, languages, and free software
Le Tuan Anh
A Python package to facilitate research on building and evaluating automated scoring models.

Rater Scoring Modeling Tool Introduction Automated scoring of written and spoken test responses is a growing field in educational natural language pro

ETS 59 Oct 10, 2022
Redlines produces a Markdown text showing the differences between two strings/text

Redlines Redlines produces a Markdown text showing the differences between two strings/text. The changes are represented with strike-throughs and unde

Houfu Ang 2 Apr 08, 2022
The project is investigating methods to extract human-marked data from document forms such as surveys and tests.

The project is investigating methods to extract human-marked data from document forms such as surveys and tests. They can read questions, multiple-choice exam papers, and grade.

Harry 5 Mar 27, 2022
基于Pytex的数学建模工具,实现将md文件转换成pdf/tex文档的前后端

Pytex-for-MCM 基于Pytex的数学建模工具,实现将md文件转换成pdf/tex文档的前后端。

3 May 17, 2021
🚩 A simple and clean python banner generator - Banners

🚩 A simple and clean python banner generator - Banners

Kumar Vicku 12 Oct 09, 2022
Shows twitch pay for any streamer from Twitch leaked CSV files.

twitch_leak_csv_reader Shows twitch pay for any streamer from Twitch leaked CSV files. Requirements: You need python3 (you can install python 3 from o

5 Nov 11, 2022
REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.

MUSE stands for Multilingual Universal Sentence Encoder - multilingual extension (supports 16 languages) of Universal Sentence Encoder (USE).

Dani El-Ayyass 47 Sep 05, 2022
A slugifier that works in unicode

Unicode Slugify Unicode Slugify is a slugifier that generates unicode slugs. It was originally used in the Firefox Add-ons web site to generate slugs

Mozilla 315 Nov 21, 2022
This is an AI that is supposed to say you if your text is formal or not

This is an AI that is supposed to say you if your text is formal or not. It's written in Python 3 and has some german examples (because I'm german yk) in the text.json file. This file contains the te

1 Jan 12, 2022
The Levenshtein Python C extension module contains functions for fast computation of Levenshtein distance and string similarity

Contents Maintainer wanted Introduction Installation Documentation License History Source code Authors Maintainer wanted I am looking for a new mainta

Antti Haapala 1.2k Dec 16, 2022
一款高性能敏感词(非法词/脏字)检测过滤组件,附带繁体简体互换,支持全角半角互换,汉字转拼音,模糊搜索等功能。

一款高性能非法词(敏感词)检测组件,附带繁体简体互换,支持全角半角互换,获取拼音首字母,获取拼音字母,拼音模糊搜索等功能。

ToolGood 3.6k Jan 07, 2023
utoken is a multilingual tokenizer that divides text into words, punctuation and special tokens such as numbers, URLs, XML tags, email-addresses and hashtags.

utoken utoken is a multilingual tokenizer that divides text into words, punctuation and special tokens such as numbers, URLs, XML tags, email-addresse

Ulf Hermjakob 11 Jan 05, 2023
汉字转拼音(pypinyin)

汉字拼音转换工具(Python 版) 将汉字转为拼音。可以用于汉字注音、排序、检索(Russian translation) 。 基于 hotoo/pinyin 开发。 Documentation: http://pypinyin.rtfd.io/ GitHub: https://github.co

Huang Huang 4.2k Jan 03, 2023
Making simplex testing clean and simple

Making Simplex Project Testing - Clean and Simple What does this repo do? It organizes the python stack for the coding project What do I need to do in

Mohit Mahajan 1 Jan 30, 2022
Search for terms(word / table / field name or any) under Snowflake schema names

snowflake-search-terms-in-ddl-views Search for terms(word / table / field name or any) under Snowflake schema names Version : 1.0v How to use ? Run th

Igal Emona 1 Dec 15, 2021
🐸 Identify anything. pyWhat easily lets you identify emails, IP addresses, and more. Feed it a .pcap file or some text and it'll tell you what it is! 🧙‍♀️

🐸 Identify anything. pyWhat easily lets you identify emails, IP addresses, and more. Feed it a .pcap file or some text and it'll tell you what it is! 🧙‍♀️

Brandon 5.6k Jan 03, 2023
Wordle strategy: Find frequency of letters appearing in 5-letter words in the English language

Find frequency of letters appearing in 5-letter words in the English language In

Gabriel Apolinário 1 Jan 17, 2022
Map Reduce Wordcount in Python using gRPC

This project is implemented in Python using gRPC. The input files are given in .txt format and the word count operation is performed.

Divija 4 Dec 05, 2022
A minimal python script for generating multiple onetime use bip39 seed phrases

seed_signer_ontimes WARNING This project has mainly been used for local development, and creation should be ran on a air-gapped machine. A minimal pyt

CypherToad 4 Sep 12, 2022
Make writing easier!

Handwriter Make writing easier! How to Download and install a handwriting font, or create a font from your handwriting. Use a word processor like Micr

64 Dec 25, 2022