The Classical Language Toolkit

Last update: Jan 09, 2023

Overview

Notice: This Git branch (dev) contains the CLTK's upcoming major release (v. 1.0.0). See https://github.com/cltk/cltk/tree/master and https://docs.cltk.org/ for the legacy code and docs.

The Classical Language Toolkit (CLTK) is a Python library offering natural language processing (NLP) for the languages of pre–modern Eurasia.

Installation

For the CLTK's latest pre-release version:

$ pip install --pre cltk

Requirements:

Python version 3.7 or above
A Unix-like OS
To install from source, see Development in the docs.

Documentation

Documentation at https://dev.cltk.org.

Citation

@Misc{johnsonetal2014,
 author = {Johnson, Kyle P. and Patrick Burns and John Stewart and Todd Cook},
 title = {CLTK: The Classical Language Toolkit},
 url = {https://github.com/cltk/cltk},
 year = {2014--2020},
}

License

Comments

Add Sanskrit stopwords

For @Akhilesh28. (Please assign this to yourself.)

In Sanskrit, a stopword list would include, at least: pronouns and determiners (source), and upasarga (verbal prefix / "preverb" / "preposition") and nipāta (particle) (which I read about here). Also add anything like conjunctions, particles, and interjections.

Just putting together this list shouldn't take more than one week. Let us know if you're having problems. You can post your stopwords here, first, as a "gist": https://gist.github.com/

opened by kylepjohnson 55
Add IPA Phonetic Transcription for Greek
This ticket is for Jack Duff, with @jtauber generously assisting.

The basic idea is to make a map of Greek letters and their IPA equivalents, something like:

{'α': 'a', 'αι', 'ai', 'ζ': 'zd', 'θ': 'tʰ'}

Obviously, it won't all be so easy, due to proximal characters changing pronunciation (for example, "γ" being IPA "ɡ" but before ["κ", "χ", "γ", "μ"] becoming "ŋ").

If you can get this down for Attic, then consider moving on to other dialects, like Ionic or Koine.

Within the CLTK's architecture, the transliteration maps and logic should go into something like cltk/phonetics/greek/transcription.py. Or consider making a general transcription entry point at cltk/phonetics/transcription.py and then declaring a which language and dialect. I'll leave the implementation details to you two, though.
enhancement
opened by kylepjohnson 51
Words to be added in Sanskrit's Stop Word Collection
~सः (He)~

~स्वयम्(himself)~

तदीय(theres) -आसम्(be)

ज्ञा (have) -परि (with) -शक्नोति(can(verb)) -यद्(if) -कतम(which)

add all the words in all their different cases, gender and and all 3 numbers(sin, dual, plural) . If you are doing it right, there must be 72 words exactly for each entity(including a few repetitions). Needs to be careful when it come to verb's word form, they are entirely different structures.

File at: https://github.com/cltk/cltk/blob/master/cltk/stop/sanskrit/stops.py
opened by nikheelpandey 42
Scraping srimad-bhagavadgita and valmiki ramayana.
I am Scapping Sanskrit - English data from

Srimad-bhagavadgita : http://www.gitasupersite.iitk.ac.in/srimad

Valmiki Ramayana : http://www.valmiki.iitk.ac.in/

Ping @kylepjohnson
new corpus
opened by ghost 36
Add corpus for classical telugu

https://te.wikisource.org/wiki contains the classical telugu ithihasas, puranas, vedas, stothras, etc; So I would like to scrape them and add as a new corpus.

Thank you.
new corpus

opened by ghost 31
Make stopwords list for Old English

To generalize, I observe that there are different approaches to making stopword lists, based either on statistics (most common words, variously calculated) or grammar (definite and indefinite articles, pronouns, etc.) (or some combination).

In doing this ticket, I would like you to do a little research on whether there exist any good lists for OE. If there is one, let's just take it. If not, we can do a little more research about what's right.
enhancement easy

opened by kylepjohnson 29
Scraping Raw Classical Hindi Data

I am scraping Raw Classical Hindi Data from http://ltrc.iiit.ac.in/showfile.php?filename=downloads/Classical_Hindi_Literature/SHUSHA/index.html @kylepjohnson
new corpus

opened by Akirato 29
Add declining tool based on Collatinus and Eulexis ?
Hi there, It's been months I have been thinking about this and I do not think CLTK contains anything like that. Collatinus and Eulexis are two Lemmatizer and Decliners which are open source (their data is either open or easy to reconstruct. And they are a nice bunch of people).

Collatinus is in C

https://github.com/biblissima/collatinus is the most up to date source code for the flexer / lemmatizer

https://github.com/ycollatin/Collatinus-data is the repo for their data (but not up to date I guess ). It seems this is more up to date.

Eulexis is in php

https://github.com/biblissima/eulexis/blob/master/traitement.php For the whole code

I'd be happy to convert the collatinus flexer for CLTK in the long run (give or take few months) but I think Eulexis and the lemmatizer part are out of my scope right now.

What's your opinion on this ? This would help search APIs a lot for text which are not lemmatized.
opened by PonteIneptique 28
Normalize Unicode throughout CLTK
I've been reading about normalize() and hope it will prevent normalization problems in the future. This builtin method solves the problem of accented characters made with combining diacritics not equaling precomposed characters. Examples of this appear in the testing library, where I have struggled to make two strings of accented Greek equal one another.

Example of normalize() from Fluent Python by Luciano Ramalho (117-118):

>>> from unicodedata import normalize >>> s1 = 'café' # composed "e" with acute accent >>> s2 = 'cafe\u0301' # decomposed "e" and acute accent >>> len(s1), len(s2) (4, 5) >>> len(normalize('NFC', s1)), len(normalize('NFC', s2)) (4, 4) >>> len(normalize('NFD', s1)), len(normalize('NFD', s2)) (5, 5) >>> normalize('NFC', s1) == normalize('NFC', s2) True >>> normalize('NFD', s1) == normalize('NFD', s2) True

Solutions

In core, use normalize with the argument 'NFC', as Fluent Python recommends. Not all Greek combining forms may reduce into precomposed … will need to be tested out.

In tests, especially for assertEqual(), check that more complicated strings equal one another. Use normalize('NFC', <text>) on the comparison strings, too, if necessary.

Use this to strip out accented characters coming from the PHI, which I don't do very gracefully here: https://github.com/kylepjohnson/cltk/blob/master/cltk/corpus/utils/formatter.py#L94

Docs: https://docs.python.org/3.4/library/unicodedata.html#unicodedata.normalize
enhancement
opened by kylepjohnson 25
add Latin WordNet API

The Latin WordNet API mimics the NLTK Princeton WordNet API in all major respects; however because the data is sourced from latinwordnet.exeter.ac.uk (rather than locally) a number of under-the-hood changes were made. Many access methods now return generators rather than lists, and in general the API is now 'lazy' where multiple HTTP requests would cause a bottleneck. The Resnick, Jiang-Conrath, and Lin similarity scoring functions work, but require availability of a corpus-based information content file (forthcoming).

opened by wmshort 24
Write syllabifiers for Indian languages
This ticket is for @soumyag213

As discussed by email, you'll port this and related modules, to the CLTK, from the Indic NLP Library.

For a first step, I'd like to see this working in your own repo, which you have started at: https://github.com/soumyag213/cltk-beginning-indo. In the README for this, I would like to see an example of its API. For example, I imagine you showing something like this is the Python shell (BTW I like iPython):

In [1]: from indic_syllabifier import orthographic_syllabify In [2]: orthographic_syllabify('supercalifragilisticexpialidocious', 'tamil') Out[2]: 'su-per-cal-i-fra-gil-ist-ic-ex-pi-al-i-doc-ious'
enhancement
opened by kylepjohnson 24

Processing text with square brackets using the Latin NLP pipeline

I noticed an anomaly processing Latin text with the default pipeline. The tokenizer fails to separate square brackets from the words they enclose.

text = 'Benedictus XVI [Iosephus Aloisius Ratzinger] fuit papa et episcopus Romanus.'

from cltk import NLP

cltk_nlp = NLP('lat')
cltk_nlp.analyze(text).tokens

Result:

['Benedictus', 'XVI', '[Iosephus', 'Aloisius', 'Ratzinger]', 'fuit', 'papa', 'et', 'episcopus', 'Romanus', '.']

The problem does not occur when the LatinWordTokenizer is used.

from cltk.tokenizers.lat.lat import LatinWordTokenizer

tokenizer = LatinWordTokenizer()
tokenizer.tokenize(text)

Result:

['Benedictus', 'XVI', '[', 'Iosephus', 'Aloisius', 'Ratzinger', ']', 'fuit', 'papa', 'et', 'episcopus', 'Romanus', '.']

Environment: Windows 10 + python 3.9.13 + cltk 1.1.6.

bug

opened by DavideMassidda 0

SpaCy process

I added the spaCy process with a custom wrapper to translate Token from spacy to Word in cltk. The aim is to be able to use trained models provided by spaCy with CLTK.

opened by clemsciences 0
A way to tell what tokens `LatinBackOffLemmatizer()` has failed to lemmatize

In LatinBackOffLemmatizer() and the lemmatizers in its chain I can't seem to find an option to return an empty value (such as in OldEnglishDictionaryLemmatizer()'s best_guess=False option), instead of returning the input value, when the lemmatizer fails to assign a lemma.

Without such an option, it doesn't seem possible to tell successful from unsuccessful lemmatization attempts programmatically, severely limiting the range of the lemmatizer's applications.
question acknowledged feature-request

opened by langeslag 6
Bump certifi from 2022.5.18.1 to 2022.12.7
Bumps certifi from 2022.5.18.1 to 2022.12.7.

Commits

9e9e840 2022.12.07

b81bdb2 2022.09.24

939a28f 2022.09.14

aca828a 2022.06.15.2

de0eae1 Only use importlib.resources's new files() / Traversable API on Python ≥3.11 ...

b8eb5e9 2022.06.15.1

47fb7ab Fix deprecation warning on Python 3.11 (#199)

b0b48e0 fixes #198 -- update link in license

9d514b4 2022.06.15

4151e88 Add py.typed to MANIFEST.in to package in sdist (#196)

See full diff in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0
Unicode issue with Greek accented vowels in prosody
Unicode has two code points for acute accented vowels, one in the Greek and Coptic block and one in the Greek extended block (for omicron they are U+03CC and U+1F79. The list of accented vowels only takes into account the acute accents in the Greek and Coptic block resulting in some vowels not being properly scanned.

>>> from cltk.prosody.grc import Scansion >>> text_string = "πότνια, θῦμον" >>> Scansion()._make_syllables(text_string) [[['πότνι', 'α'], ['θῦ', 'μον']]]

Expected behavior

>>> from cltk.prosody.grc import Scansion >>> text_string = "πότνια, θῦμον" >>> Scansion()._make_syllables(text_string) [[['πο', 'τνι' , 'α'], ['θῦ', 'μον']]]

Desktop

MacOS 13.0

bug
opened by JoshuaCCampbell 1
Latin enclitic tokenizer broken?

Latin tokenizer does not separate -que, ne, ve. In line 147 of tokenizers/lat/lat.py I suggest: specific_tokens += [token[: -len(enclitic)]] + ["-"+enclitic] This fixed it for me.

Mac OS 15.7 Python 3.9
bug

opened by polycrates 3

Releases(1.0.15)

1.0.15(Jun 10, 2021)

CLTK release version 1.0.15 triggered on 10/06/2021 at 16:34:40.
Source code(tar.gz)
Source code(zip)
1.0.14(May 21, 2021)

CLTK release version 1.0.14 triggered on 21/05/2021 at 17:15:17.
Source code(tar.gz)
Source code(zip)
1.0.13(May 21, 2021)

CLTK release version 1.0.13 triggered on 21/05/2021 at 16:27:34.
Source code(tar.gz)
Source code(zip)
1.0.12(Apr 30, 2021)

CLTK release version 1.0.12 triggered on 30/04/2021 at 15:16:53.
Source code(tar.gz)
Source code(zip)
1.0.11(Apr 13, 2021)

CLTK release version 1.0.11 triggered on 13/04/2021 at 02:45:22.
Source code(tar.gz)
Source code(zip)
1.0.10(Mar 30, 2021)

CLTK release version 1.0.10 triggered on 30/03/2021 at 16:18:04.
Source code(tar.gz)
Source code(zip)
1.0.9(Mar 28, 2021)

CLTK release version 1.0.9 triggered on 28/03/2021 at 15:47:52.
Source code(tar.gz)
Source code(zip)
1.0.8(Mar 26, 2021)

CLTK release version 1.0.8 triggered on 26/03/2021 at 02:35:17.
Source code(tar.gz)
Source code(zip)
1.0.7(Mar 22, 2021)

CLTK release version 1.0.7 triggered on 22/03/2021 at 01:32:18.
Source code(tar.gz)
Source code(zip)
1.0.6(Mar 19, 2021)

CLTK release version 1.0.6 triggered on 19/03/2021 at 05:19:56.
Source code(tar.gz)
Source code(zip)
1.0.5(Mar 6, 2021)

CLTK release version 1.0.5 triggered on 06/03/2021 at 18:28:23.
Source code(tar.gz)
Source code(zip)
1.0.4(Mar 4, 2021)

CLTK release version 1.0.4 triggered on 04/03/2021 at 22:42:07.
Source code(tar.gz)
Source code(zip)
1.0.3(Mar 4, 2021)

CLTK release version 1.0.3 triggered on 04/03/2021 at 22:13:35.
Source code(tar.gz)
Source code(zip)
1.0.1(Mar 4, 2021)

CLTK release version 1.0.1 triggered on 04/03/2021 at 22:00:06.
Source code(tar.gz)
Source code(zip)
1.0.0b9(Feb 27, 2021)

CLTK release version 1.0.0b9 triggered on 27/02/2021 at 02:49:38.
Source code(tar.gz)
Source code(zip)
1.0.0b10(Feb 27, 2021)

CLTK release version 1.0.0b10 triggered on 27/02/2021 at 06:10:40.
Source code(tar.gz)
Source code(zip)
1.0.0b8(Feb 24, 2021)

CLTK release version 1.0.0b8 triggered on 24/02/2021 at 16:38:15.
Source code(tar.gz)
Source code(zip)
1.0.0b7(Feb 21, 2021)

CLTK release version 1.0.0b7 triggered on 21/02/2021 at 08:42:14.
Source code(tar.gz)
Source code(zip)
1.0.0b6(Feb 14, 2021)

CLTK release version 1.0.0b6 triggered on 14/02/2021 at 22:47:49.
Source code(tar.gz)
Source code(zip)
1.0.0b5(Feb 14, 2021)

CLTK release version 1.0.0b5 triggered on 14/02/2021 at 19:35:48.
Source code(tar.gz)
Source code(zip)
1.0.0b4(Feb 14, 2021)

CLTK release version 1.0.0b4 triggered on 14/02/2021 at 06:25:47.
Source code(tar.gz)
Source code(zip)
1.0.0b3(Feb 8, 2021)

CLTK release version 1.0.0b3 triggered on 08/02/2021 at 03:08:24.
Source code(tar.gz)
Source code(zip)
1.0.1b2(Feb 7, 2021)

CLTK release version 1.0.1b2 triggered on 07/02/2021 at 20:32:48.
Source code(tar.gz)
Source code(zip)
v0.1.111(Sep 19, 2019)

The main purpose of this release is to mark the inclusion of Tyler Kirby ( @TylerKirby ) and Tom Keeline's code commits for their research into prose style ( #927 ; https://github.com/cltk/cltk/pull/927 ).

Other code commits since the previous here: https://github.com/cltk/cltk/pulls?utf8=%E2%9C%93&q=is%3Apr+is%3Aclosed+created%3A2018-10-28..2019-09-18+
Source code(tar.gz)
Source code(zip)
v0.1.99(Nov 29, 2018)

Triggering a new DOI release version, since it has been over a year.

There have been many updates since the previous "release" (in GitHub's terminology), so it includes a large number of functionality.
Source code(tar.gz)
Source code(zip)
untagged-b4997d46ce5d4c3d468c(Oct 18, 2017)

null
Source code(tar.gz)
Source code(zip)
contributors.md(4.10 KB)
untagged-7cf6c55ce9051e332613(Oct 18, 2017)

null
Source code(tar.gz)
Source code(zip)
contributors.md(4.10 KB)
untagged-5ecbb01a1f820e2b1739(Oct 18, 2017)

null
Source code(tar.gz)
Source code(zip)
contributors.md(4.10 KB)
v0.1.64(Sep 1, 2017)

This release contains lots of support for Old and Middle French from @nat1881 , sponsored by the Google Summer of Code and mentored by @diyclassics and @mlj

From PR #571 #574 #575
Source code(tar.gz)
Source code(zip)
v0.1.63(Aug 29, 2017)

Adding new Latin prosody functionality from @todd-cook 's PR #573
Source code(tar.gz)
Source code(zip)

Owner

Classical Language Toolkit

Natural language processing for Classical languages

GitHub Repository http://cltk.org

An implementation of WaveNet with fast generation

pytorch-wavenet This is an implementation of the WaveNet architecture, as described in the original paper. Features Automatic creation of a dataset (t

858 Dec 27, 2022

BERN2: an advanced neural biomedical namedentity recognition and normalization tool

BERN2 We present BERN2 (Advanced Biomedical Entity Recognition and Normalization), a tool that improves the previous neural network-based NER tool by

99 Jan 06, 2023

문장단위로 분절된 나무위키 데이터셋. Releases에서 다운로드 받거나, tfds-korean을 통해 다운로드 받으세요.

Namuwiki corpus 문장단위로 미리 분절된 나무위키 코퍼스. 목적이 LM등에서 사용하기 위한 데이터셋이라, 링크/이미지/테이블 등등이 잘려있습니다. 문장 단위 분절은 kss를 활용하였습니다. 라이선스는 나무위키에 명시된 바와 같이 CC BY-NC-SA 2.0

16 Apr 02, 2022

STS Benchmark comprises a selection of the English datasets used in the STS tasks organized in the context of SemEval between 2012 and 2017. The selection of datasets include text from image captions, news headlines and user forums.

stsb_multi_mt_en STS Benchmark comprises a selection of the English datasets used in the STS tasks organized in the context of SemEval between 2012 an

2 Nov 05, 2021

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

English | 简体中文 | 繁體中文 | 한국어 State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow 🤗 Transformers provides thousands of pretrained models

77.1k Dec 31, 2022

Textlesslib - Library for Textless Spoken Language Processing

textlesslib Textless NLP is an active area of research that aims to extend NLP t

379 Dec 27, 2022

Scikit-learn style model finetuning for NLP

Scikit-learn style model finetuning for NLP Finetune is a library that allows users to leverage state-of-the-art pretrained NLP models for a wide vari

665 Dec 17, 2022

Extracting Summary Knowledge Graphs from Long Documents

GraphSum This repo contains the data and code for the G2G model in the paper: Extracting Summary Knowledge Graphs from Long Documents. The other basel

10 Oct 21, 2022

A simple Flask site that allows users to create, update, and delete posts in a database, as well as perform basic NLP tasks on the posts.

1 Jan 15, 2022

Honor's thesis project analyzing whether the GPT-2 model can more effectively generate free-verse or structured poetry.

gpt2-poetry The following code is for my senior honor's thesis project, under the guidance of Dr. Keith Holyoak at the University of California, Los A

2 Jan 09, 2022

Türkçe küfürlü içerikleri bulan bir yapay zeka kütüphanesi / An ML library for profanity detection in Turkish sentences

"Kötü söz sahibine aittir." -Anonim Nedir? sinkaf uygunsuz yorumların bulunmasını sağlayan bir python kütüphanesidir. Farkı nedir? Diğer algoritmalard

4 Feb 18, 2022

a chinese segment base on crf

Genius Genius是一个开源的python中文分词组件，采用 CRF(Conditional Random Field)条件随机场算法。 Feature 支持python2.x、python3.x以及pypy2.x。支持简单的pinyin分词支持用户自定义break 支持用户自定义合并词

237 Nov 04, 2022

Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform tasks on automatic speech recogniti

26 Dec 14, 2022

It analyze the sentiment of the user, whether it is postive or negative.

Sentiment-Analyzer-Tool It analyze the sentiment of the user, whether it is postive or negative. It uses streamlit library for creating this sentiment

18 Dec 17, 2022

Protein Language Model

ProteinLM We pretrain protein language model based on Megatron-LM framework, and then evaluate the pretrained model results on TAPE (Tasks Assessing P

77 Dec 27, 2022

构建一个多源（公众号、RSS）、干净、个性化的阅读环境

2C 构建一个多源（公众号、RSS）、干净、个性化的阅读环境作为一名微信公众号的重度用户，公众号一直被我设为汲取知识的地方。随着使用程度的增加，相信大家或多或少会有一个比较头疼的问题——广告问题。假设你关注的公众号有十来个，若一个公众号两周接一次广告，理论上你会面临二十多次广告，实际上会更多，运

678 Dec 28, 2022

Tool to add main subject to items on Wikidata using a WMFs CirrusSearch for named entity recognition or a manually supplied list of QIDs

ItemSubjector Tool made to add main subject statements to items based on the title using a home-brewed CirrusSearch-based Named Entity Recognition alg

9 Nov 17, 2022

The Classical Language Toolkit

Related tags

Overview

Installation

Documentation

Citation

License

Comments

Solutions

Releases(1.0.15)

1.0.15(Jun 10, 2021)

1.0.14(May 21, 2021)

1.0.13(May 21, 2021)

1.0.12(Apr 30, 2021)

1.0.11(Apr 13, 2021)

1.0.10(Mar 30, 2021)

1.0.9(Mar 28, 2021)

1.0.8(Mar 26, 2021)

1.0.7(Mar 22, 2021)

1.0.6(Mar 19, 2021)

1.0.5(Mar 6, 2021)

1.0.4(Mar 4, 2021)

1.0.3(Mar 4, 2021)

1.0.1(Mar 4, 2021)

1.0.0b9(Feb 27, 2021)

1.0.0b10(Feb 27, 2021)

1.0.0b8(Feb 24, 2021)

1.0.0b7(Feb 21, 2021)

1.0.0b6(Feb 14, 2021)

1.0.0b5(Feb 14, 2021)

1.0.0b4(Feb 14, 2021)

1.0.0b3(Feb 8, 2021)

1.0.1b2(Feb 7, 2021)

v0.1.111(Sep 19, 2019)

v0.1.99(Nov 29, 2018)

untagged-b4997d46ce5d4c3d468c(Oct 18, 2017)

untagged-7cf6c55ce9051e332613(Oct 18, 2017)

untagged-5ecbb01a1f820e2b1739(Oct 18, 2017)

v0.1.64(Sep 1, 2017)

v0.1.63(Aug 29, 2017)