🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.

Last update: Jan 06, 2023

Overview

pySBD: Python Sentence Boundary Disambiguation (SBD)

pySBD - python Sentence Boundary Disambiguation (SBD) - is a rule-based sentence boundary detection module that works out-of-the-box.

This project is a direct port of ruby gem - Pragmatic Segmenter which provides rule-based sentence boundary detection.

Highlights

'PySBD: Pragmatic Sentence Boundary Disambiguation' a short research paper got accepted into 2nd Workshop for Natural Language Processing Open Source Software (NLP-OSS) at EMNLP 2020.

Research Paper:

https://arxiv.org/abs/2010.09657

Recorded Talk:

Poster:

Install

Python

pip install pysbd

Usage

Currently pySBD supports 22 languages.

import pysbd
text = "My name is Jonas E. Smith. Please turn to p. 55."
seg = pysbd.Segmenter(language="en", clean=False)
print(seg.segment(text))
# ['My name is Jonas E. Smith.', 'Please turn to p. 55.']

Use pysbd as a spaCy pipeline component. (recommended)
Please refer to example pysbd_as_spacy_component.py
Use pysbd through entrypoints

import spacy
from pysbd.utils import PySBDFactory

nlp = spacy.blank('en')

# explicitly adding component to pipeline
# (recommended - makes it more readable to tell what's going on)
nlp.add_pipe(PySBDFactory(nlp))

# or you can use it implicitly with keyword
# pysbd = nlp.create_pipe('pysbd')
# nlp.add_pipe(pysbd)

doc = nlp('My name is Jonas E. Smith. Please turn to p. 55.')
print(list(doc.sents))
# [My name is Jonas E. Smith., Please turn to p. 55.]

Contributing

If you want to contribute new feature/language support or found a text that is incorrectly segmented using pySBD, then please head to CONTRIBUTING.md to know more and follow these steps.

Fork it ( https://github.com/nipunsadvilkar/pySBD/fork )
Create your feature branch (git checkout -b my-new-feature)
Commit your changes (git commit -am 'Add some feature')
Push to the branch (git push origin my-new-feature)
Create a new Pull Request

Citation

If you use pysbd package in your projects or research, please cite PySBD: Pragmatic Sentence Boundary Disambiguation.

@inproceedings{sadvilkar-neumann-2020-pysbd,
    title = "{P}y{SBD}: Pragmatic Sentence Boundary Disambiguation",
    author = "Sadvilkar, Nipun  and
      Neumann, Mark",
    booktitle = "Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.nlposs-1.15",
    pages = "110--114",
    abstract = "We present a rule-based sentence boundary disambiguation Python package that works out-of-the-box for 22 languages. We aim to provide a realistic segmenter which can provide logical sentences even when the format and domain of the input text is unknown. In our work, we adapt the Golden Rules Set (a language specific set of sentence boundary exemplars) originally implemented as a ruby gem pragmatic segmenter which we ported to Python with additional improvements and functionality. PySBD passes 97.92{\%} of the Golden Rule Set examplars for English, an improvement of 25{\%} over the next best open source Python tool.",
}

Credit

This project wouldn't be possible without the great work done by Pragmatic Segmenter team.

Comments

Question marks at the end swallowed
Looks like the example with just question marks is good now:

>>> segmenter.segment("??") ['??']

but the example with double question marks as a token at the end of a sentence still loses the question marks:

>>> segmenter.segment("T stands for the vector transposition. As shown in Fig. ??") ['T stands for the vector transposition.', 'As shown in Fig.']

looks like this is the minimal repro:

>>> segmenter.segment("Fig. ??") ['Fig.']
bug edge-cases
opened by dakinggg 11
Pysbd just hangs🐛

Describe the bug The process hangs .

To Reproduce Steps to reproduce the behavior: Input text - <f.302205302116302416302500302513915bd> flat = "f.302205302116302416302500302513915bd" print(flat) x=segClean = pysbd.Segmenter(language="en", clean=True, char_span=False) for z in x.segment(flat): print(z)

Example: Input text - "My name is Jonas E. Smith. Please turn to p. 55."

Expected behavior Return f.302205302116302416302500302513915

Example: ['f.302205302116302416302500302513915bd']

Additional context Add any other context about the problem here.
help wanted

opened by kariato 8

Incorrect text span start and end returned

Looks like something weird happening in this case, note that the indices of the second text span are incorrect:

>>> seg = pysbd.Segmenter(language='en', clean=False, char_span=True)
>>> seg.segment("1) The first item. 2) The second item.")                                                                                
[TextSpan(sent='1) The first item.', start=0, end=18), TextSpan(sent='2) The second item.', start=0, end=19)]

bug

opened by dakinggg 7

Performance improvement?

I am not certain of this, but I suspect there might be room for performance improvement by using re.compile to precompile all of the needed regexs. Otherwise they will have to be compiled regularly (once the re cache of 100 has been exceeded)
question

opened by dakinggg 7
Slovak lang support
We've added support for SBD in Slovak language text.

Language specific improvements:

list of common slovak abbreviations

list of prepositive abbreviations

list of number abbreviations

handling of roman numerals

handling of „ text “ quotes, that are common in Slovak language

handling of ordinal numerals in dates, such as 17. Apríl 2020

modified the replacement of periods in abbreviations, so it can consistently handle common Slovak abbreviations such as Company Name s. r. o.

disabled processing of alphabetical lists, because of conflicts with some common abbreviations

The code has been tested for stability on a very large corpus of web text. The has been no rigorous testing for segmentation quality, but the subjective feeling in the team is very positive.
language
opened by misotrnka 6
Different segmentation with Spacy and when using pySBD directly
Firstly thank you for this project - I was lucky to find it and it is really useful

I seem to have found a case where the segmentation is behaving differently when run within the Spacy pipeline and when run using pySBD directly. I stumbled on it with my own text where a sentence after a previous sentence that was in quotes was being lumped together. I looked through the Golden Rules and found this wasn't expected and then noticed that even with the text in one of your tests it acts differently in Spacy.

To reproduce run these two bits of code:

from pysbd.utils import PySBDFactory nlp = spacy.blank('en') nlp.add_pipe(PySBDFactory(nlp)) doc = nlp("She turned to him, \"This is great.\" She held the book out to show him.") for sent in doc.sents: print(str(sent).strip() + '\n')

She turned to him, "This is great." She held the book out to show him.

import pysbd text = "She turned to him, \"This is great.\" She held the book out to show him." seg = pysbd.Segmenter(language="en", clean=False) #print(seg.segment(text)) for sent in seg.segment(text): print(str(sent).strip() + '\n')

She turned to him, "This is great."

She held the book out to show him.

The second way is the desired output (based on the rules at least)
bug help wanted
opened by nmstoker 6

destructive behaviour in edge-cases

As of v0.3.3, pySBD shows destructive behavior in some edge-cases even when setting the option clean to False. When dealing with OCR text, pySBD removes whitespace after multiple periods.

To reproduce

import pysbd

splitter = pysbd.Segmenter(language="fr", clean=False)

text = "Maissen se chargea du reste .. Logiquement,"
print(splitter.segment(text))

text = "Maissen se chargea du reste ... Logiquement,"
print(splitter.segment(text))

text = "Maissen se chargea du reste .... Logiquement,"
print(splitter.segment(text))

Actual output Please note the missing whitespace after the final period in the example with .. and .....

['Maissen se chargea du reste .', '.', 'Logiquement,']
['Maissen se chargea du reste ... ', 'Logiquement,']
['Maissen se chargea du reste .', '...', 'Logiquement,']

Expected output

['Maissen se chargea du reste .', '. ', 'Logiquement,']
['Maissen se chargea du reste ... ', 'Logiquement,']
['Maissen se chargea du reste .', '... ', 'Logiquement,']

In general, pySBD works well. Many thanks @nipunsadvilkar. I can also look into this as soon as I find some time and open a pull request.

bug edge-cases

opened by aflueckiger 5

🏎 ⚡️ 💯 [Rough] Benchmark across Segmentation Tools, Libraries and Algorithms
Segmentation Tools, Libraries and Algorithms:

[x] Stanza

[x] syntok

[x] NLTK

[x] spaCy

[x] blingfire

| Tool | Accuracy | Speed (ms) | |-----------|----------|------------| | blingfire | 75.00% | 49.91 | | pySBD | 97.92% | 2449.18 | | syntok | 68.75% | 783.73 | | spaCy | 52.08% | 473.96 | | stanza | 72.92% | 120803.37 | | NLTK | 56.25% | 342.98 |
opened by nipunsadvilkar 5
✨ 💫 Support Multiple languages
Languages to be supported:

[x] English

[x] Bulgarian

[x] Spanish

[x] Russian

[x] Arabic

[x] Amharic

[x] Marathi

[x] Hindi

[x] Armenian

[x] Persian

[x] Urdu

[x] Polish

[x] Chinese

[x] Dutch

[x] Danish

[x] French

[x] Italian

[x] Greek

[x] Burmese

[x] Japanese

[x] Deutsch

[x] Kazakh

enhancement
opened by nipunsadvilkar 4
Regexp issues

I'm getting errors because the regexp engine interprets parentesis: "unterminated subpattern" and "unbalanced parenthesis".

I'm analysing very large amounts of text, so not sure how these were triggered.

opened by mollerhoj 4
Reduce some calls to re.sub

So calls to re.compile are not a problem. The main thing slowing it down is lots of calls to re.sub in abbreviation_replacer.py. I reduced some of these calls which speeds it up by a factor of ~3-3.5x on my machine, for the specific (longish) document that I tested with. I also included the script I used to test timing. Given that you are much more familiar with the codebase, see if my changes look reasonable, but all the tests do still pass. There are probably some more ways to speed up the calls in that file.
enhancement

opened by dakinggg 4

How is accuracy on OPUS-100 computed?

Hi! Thanks for this library.

Since there is no notion of documents in the OPUS-100 dataset it is not clear to me how accuracy is computed. I tried a naive approach using pairwise joining of sentences:

from datasets import load_dataset
import pysbd

if __name__ == "__main__":
    sentences = [
        sample["de"].strip()
        for sample in load_dataset("opus100", "de-en", split="test")["translation"]
    ]

    correct = 0
    total = 0

    segmenter = pysbd.Segmenter(language="de")

    for sent1, sent2 in zip(sentences, sentences[1:]):
        out = tuple(
            s.strip() for s in segmenter.segment(sent1 + " " + sent2)
        )

        total += 1

        if out == (sent1, sent2):
            correct += 1

    print(f"{correct}/{total} = {correct / total}")

But I get 1011/1999 = 50.6% Accuracy which is not close to the 80.95% Accuracy reported in the paper.

Thanks for any help!

opened by bminixhofer 1

Added decorator as required by latest SpaCy

Hello!

In using pySBD, I've noticed that the current example script no longer works with the latest version of SpaCy (3.3.0). This is the traceback I get:

Traceback (most recent call last):
  File "/Users/lucas/Code/significant-statements-extraction/scripts/test_pysbd.py", line 27, in <module>
    nlp.add_pipe(pysbd_sentence_boundaries)
  File "/Users/lucas/miniforge3/envs/pytorch_p39/lib/python3.9/site-packages/spacy/language.py", line 773, in add_pipe
    raise ValueError(err)
ValueError: [E966] `nlp.add_pipe` now takes the string name of the registered component factory, not a callable component. Expected string, but got <function pysbd_sentence_boundaries at 0x11ffa9160> (name: 'None').

- If you created your component with `nlp.create_pipe('name')`: remove nlp.create_pipe and call `nlp.add_pipe('name')` instead.

- If you passed in a component like `TextCategorizer()`: call `nlp.add_pipe` with the string name instead, e.g. `nlp.add_pipe('textcat')`.

- If you're using a custom component: Add the decorator `@Language.component` (for function components) or `@Language.factory` (for class components / factories) to your custom component and assign it a name, e.g. `@Language.component('your_name')`. You can then run `nlp.add_pipe('your_name')` to add it to the pipeline.

This pull requests add a @Language.component decorator to make pySBD available in SpaCy again.

opened by soldni 0

Arabic sentence split on the Arabic comma
Describe the bug Arabic sentence split on the Arabic comma.

To Reproduce Steps to reproduce the behavior:

import pysbd text = "هذه تجربة، للغة العربية" seg = pysbd.Segmenter(language="ar", clean=True) >>> print(seg.segment(text))

Output: ['هذه تجربة،', 'للغة العربية']

Expected behavior The text should not be split on the Arabic comma. Expected output: ['هذه تجربة، للغة العربية']

Additional context I locally fixed it by modifying the file: pysbd/lang/arabic.py, deleting ، from SENTENCE_BOUNDARY_REGEX.
opened by ymoslem 0
Does pysbd delete sentences after detection ?

Hey there, So ive been using pysbd to detect boundries in hindi and marathi language and then save the same data rearranged from a paragraph to one sentence boundry per sample. Unfortunately the storage size has gone down from 22GB to 14.5 GB after just detecting boundries and just saving them per sentence. and yes i did turn off the clean args.

opened by StephennFernandes 0
Update pysbd_as_spacy_component.py

Thanks for a great sentence splitting package. A small contribution, after troubleshooting, why the code was not working out of the box. The spacy v3 requires a string in the add_pipe() call. The component need to be declared using the language decorator. See also https://spacy.io/usage/processing-pipelines#custom-components. Hope it helps other users.

opened by guebeln0 0

Releases(v0.3.4)

v0.3.4(Feb 11, 2021)
🐛 Fix trailing period/ellipses with spaces - #83

🐛 Regex escape for parenthesis - #87

Source code(tar.gz)
Source code(zip)
v0.3.3(Oct 8, 2020)
🐛 Better handling consecutive periods and reserved special symbols - allenai/scholarphi#114

Add CONTRIBUTING.md

Source code(tar.gz)
Source code(zip)
v0.3.2(Sep 11, 2020)
🐛 ✅ Enforce clean=True when doc_type="pdf" - #75

Source code(tar.gz)
Source code(zip)
v0.3.1(Aug 11, 2020)
v0.3.1

🚑 ✅ Handle Newline character & update tests

Source code(tar.gz)
Source code(zip)
v0.3.0(Aug 11, 2020)
v0.3.0

✨ 💫 Support Multiple languages - #2

🏎⚡️💯 Benchmark across Segmentation Tools, Libraries and Algorithms

🎨 ♻️ Update sentence char_span logic

⚡️ Performance improvements - #41

♻️🐛 Refactor AbbreviationReplacer

Source code(tar.gz)
Source code(zip)
v0.3.0rc(Jun 9, 2020)
✨ 💫 sent char_span through with spaCy & regex approach - #63

♻️ Refactoring to support multiple languages

✨ 💫Initial language support for - Hindi, Marathi, Chinese, Spanish

✅ Updated tests - more coverage & regression tests for issues

👷👷🏻‍♀️ GitHub actions for CI-CD

💚☂️ Add code coverage - coverage.py Add Codecov

🐛 Fix incorrect text span & vanilla pysbd vs spacy output discrepancy - #49, #53, #55 , #59

🐛 Fix NUMBERED_REFERENCE_REGEX for zero or one time - #58

🔐Fix security vulnerability bleach - #62

Source code(tar.gz)
Source code(zip)
v0.2.3(Nov 13, 2019)

🐛 Performance improvement in abbreviation_replacer by reducing re.sub calls - @danielkingai2 #50
Source code(tar.gz)
Source code(zip)
v0.2.2(Nov 1, 2019)
🐛 Fix unbalanced parenthesis - #47

Source code(tar.gz)
Source code(zip)
v0.2.1(Oct 30, 2019)
✨ pysbd as a spacy component through entrypoints

Source code(tar.gz)
Source code(zip)
v0.2.0(Oct 25, 2019)
✨Add char_span parameter (optional) to get sentence & its (start, end) char offsets from original text

✨pySBD as a spaCy component example

🐛 Fix double question mark swallow bug - #39

Source code(tar.gz)
Source code(zip)
v0.1.5(Oct 24, 2019)
🐛 Handle text with only punctuations - #36

🐛 Handle exclamation marks at EOL- #37

Source code(tar.gz)
Source code(zip)
v0.1.4(Oct 20, 2019)
✨ ✅ Handle intermittent punctuations added special case: r"[。．.！!?].*" to handle intermittent dots, exclaimation, etc. special cases group can be updated as per developer needs- #34

Source code(tar.gz)
Source code(zip)
v0.1.3(Oct 19, 2019)
🐛 Fix lists_item_replacer - #29

🐛 Fix & ♻️ refactor replace_multi_period_abbreviations - #30

🐛 Fix abbreviation_replacer - #31

✅ Add regression tests for issues

Source code(tar.gz)
Source code(zip)
v0.1.2(Oct 18, 2019)

Fixed #27 through #28
Source code(tar.gz)
Source code(zip)
v0.1.1(Oct 9, 2019)

Support for only english language. WIP other languages
Source code(tar.gz)
Source code(zip)

Owner

Nipun Sadvilkar

I like to explore Jungle of Data with Python as my swiss knife with pandas, numpy, matplotlib and scikit-learn as its multi-tools😅

GitHub Repository

✨Rubrix is a production-ready Python framework for exploring, annotating, and managing data in NLP projects.

✨A Python framework to explore, label, and monitor data for NLP projects

1.5k Jan 02, 2023

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS)

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS) Yoonhyung Lee, Joongbo Shin, Kyomin Jung Abstract: Although early

147 Dec 05, 2022

Framework for fine-tuning pretrained transformers for Named-Entity Recognition (NER) tasks

NERDA Not only is NERDA a mesmerizing muppet-like character. NERDA is also a python package, that offers a slick easy-to-use interface for fine-tuning

141 Dec 30, 2022

AIDynamicTextReader - A simple dynamic text reader based on Artificial intelligence

AI Dynamic Text Reader: This is a simple dynamic text reader based on Artificial

1 Jan 18, 2022

Unofficial PyTorch implementation of Google AI's VoiceFilter system

VoiceFilter Note from Seung-won (2020.10.25) Hi everyone! It's Seung-won from MINDs Lab, Inc. It's been a long time since I've released this open-sour

881 Jan 03, 2023

Study German declensions (dER nettE Mann, ein nettER Mann, mit dEM nettEN Mann, ohne dEN nettEN Mann ...) Generate as many exercises as you want using the incredible power of SPACY!

4 Jul 20, 2022

ALIbaba's Collection of Encoder-decoders from MinD (Machine IntelligeNce of Damo) Lab

AliceMind AliceMind: ALIbaba's Collection of Encoder-decoders from MinD (Machine IntelligeNce of Damo) Lab This repository provides pre-trained encode

1.4k Jan 04, 2023

📝An easy-to-use package to restore punctuation of the text.

✏️ rpunct - Restore Punctuation This repo contains code for Punctuation restoration. This package is intended for direct use as a punctuation restorat

72 Dec 30, 2022

Neural network models for joint POS tagging and dependency parsing (CoNLL 2017-2018)

Neural Network Models for Joint POS Tagging and Dependency Parsing Implementations of joint models for POS tagging and dependency parsing, as describe

152 Sep 02, 2022

PyTranslator é simultaneamente um editor e tradutor de texto com diversos recursos e interface feito com coração e 100% em Python

PyTranslator O Que é e para que serve o PyTranslator? PyTranslator é simultaneamente um editor e tradutor de texto em com interface gráfica que usa a

1 May 12, 2022

CLIPfa: Connecting Farsi Text and Images

CLIPfa: Connecting Farsi Text and Images OpenAI released the paper Learning Transferable Visual Models From Natural Language Supervision in which they

66 Dec 14, 2022

PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

StyleSpeech - PyTorch Implementation PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation. Status (2021.06.09

142 Jan 06, 2023

PyTorch code for EMNLP 2019 paper "LXMERT: Learning Cross-Modality Encoder Representations from Transformers".

LXMERT: Learning Cross-Modality Encoder Representations from Transformers Our servers break again :(. I have updated the links so that they should wor

838 Dec 19, 2022

Official implementations for various pre-training models of ERNIE-family, covering topics of Language Understanding & Generation, Multimodal Understanding & Generation, and beyond.

English|简体中文 ERNIE是百度开创性提出的基于知识增强的持续学习语义理解框架，该框架将大数据预训练与多源丰富知识相结合，通过持续学习技术，不断吸收海量文本数据中词汇、结构、语义等方面的知识，实现模型效果不断进化。ERNIE在累积 40 余个典型 NLP 任务取得 SOTA 效果，并在 G

5.4k Jan 03, 2023

🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.

Related tags

Overview

pySBD: Python Sentence Boundary Disambiguation (SBD)

Highlights

Install

Usage

Contributing

Citation

Credit

Comments

Releases(v0.3.4)

v0.3.4(Feb 11, 2021)

v0.3.3(Oct 8, 2020)

v0.3.2(Sep 11, 2020)

v0.3.1(Aug 11, 2020)

v0.3.1

v0.3.0(Aug 11, 2020)

v0.3.0

v0.3.0rc(Jun 9, 2020)

v0.2.3(Nov 13, 2019)

v0.2.2(Nov 1, 2019)

v0.2.1(Oct 30, 2019)

v0.2.0(Oct 25, 2019)

v0.1.5(Oct 24, 2019)

v0.1.4(Oct 20, 2019)

v0.1.3(Oct 19, 2019)

v0.1.2(Oct 18, 2019)

v0.1.1(Oct 9, 2019)