📝An easy-to-use package to restore punctuation of the text.

Related tags

Text Data & NLPrpunct
Overview

✏️ rpunct - Restore Punctuation

forthebadge

This repo contains code for Punctuation restoration.

This package is intended for direct use as a punctuation restoration model for the general English language. Alternatively, you can use this for further fine-tuning on domain-specific texts for punctuation restoration tasks. It uses HuggingFace's bert-base-uncased model weights that have been fine-tuned for Punctuation restoration.

Punctuation restoration works on arbitrarily large text. And uses GPU if it's available otherwise will default to CPU.

List of punctuations we restore:

  • Upper-casing
  • Period: .
  • Exclamation: !
  • Question Mark: ?
  • Comma: ,
  • Colon: :
  • Semi-colon: ;
  • Apostrophe: '
  • Dash: -

🚀 Usage

Below is a quick way to get up and running with the model.

  1. First, install the package.
pip install rpunct
  1. Sample python code.
from rpunct import RestorePuncts
# The default language is 'english'
rpunct = RestorePuncts()
rpunct.punctuate("""in 2018 cornell researchers built a high-powered detector that in combination with an algorithm-driven process called ptychography set a world record
by tripling the resolution of a state-of-the-art electron microscope as successful as it was that approach had a weakness it only worked with ultrathin samples that were
a few atoms thick anything thicker would cause the electrons to scatter in ways that could not be disentangled now a team again led by david muller the samuel b eckert
professor of engineering has bested its own record by a factor of two with an electron microscope pixel array detector empad that incorporates even more sophisticated
3d reconstruction algorithms the resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves""")
# Outputs the following:
# In 2018, Cornell researchers built a high-powered detector that, in combination with an algorithm-driven process called Ptychography, set a world record by tripling the
# resolution of a state-of-the-art electron microscope. As successful as it was, that approach had a weakness. It only worked with ultrathin samples that were a few atoms
# thick. Anything thicker would cause the electrons to scatter in ways that could not be disentangled. Now, a team again led by David Muller, the Samuel B. 
# Eckert Professor of Engineering, has bested its own record by a factor of two with an Electron microscope pixel array detector empad that incorporates even more
# sophisticated 3d reconstruction algorithms. The resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves.

🎯 Accuracy

Here is the number of product reviews we used for finetuning the model:

Language Number of text samples
English 560,000

We found the best convergence around 3 epochs, which is what presented here and available via a download.


The fine-tuned model obtained the following accuracy on 45,990 held-out text samples:

Accuracy Overall F1 Eval Support
91% 90% 45,990

💻 🎯 Further Fine-Tuning

To start fine-tuning or training please look into training/train.py file. Running python training/train.py will replicate the results of this model.


Contact

Contact Daulet Nurmanbetov for questions, feedback and/or requests for similar models.


Comments
  • Update requirements.txt

    Update requirements.txt

    ERROR: Could not find a version that satisfies the requirement torch==1.8.1 (from rpunct) (from versions: 1.11.0, 1.12.0, 1.12.1, 1.13.0) ERROR: No matching distribution found for torch==1.8.1

    opened by Rukaya-lab 0
  • Forked repo with fixes

    Forked repo with fixes

    I forked this repository (link here) to fix the outdated dependencies and incompatibility with non-CUDA machines. If anyone needs these fixes, feel free to install from the fork:

    pip install git+https://github.com/samwaterbury/rpunct.git
    

    Hopefully this repository is updated or another maintainer is assigned. And thanks to the creator @Felflare, this is a useful tool!

    opened by samwaterbury 2
  • Requirements shouldn't ask for such specific versions

    Requirements shouldn't ask for such specific versions

    First, thanks a lot for providing this package :)

    Currently, the requirements.txt, and thus the dependencies in the setup.py are for very specific versions of Pytorch etc. This shouldn't be the case if you want this package to be used as a general library (think of a second package that would do the same but ask for an incompatible version of PyTorch and would prevent any possible installation of the two together). The end user might also be needing a more recent version of PyTorch. Given that PyTorch is almost always backward compatible, and quite stable, I think the requirements for it could be changed from ==1.8.1 to >=1.8.1. I believe the same would be true for the other packages.

    opened by adefossez 2
  • Added ability to pass additional parameters to simpletransformer ner in RestorePuncts class.

    Added ability to pass additional parameters to simpletransformer ner in RestorePuncts class.

    Thanks for the great library! When running this without a GPU I had problems. I think there is a simple fix. The simple transformer NER model defaults to enabling cuda. This PR allows the user to pass a dictionary of arguments specifically for the simpletransformers NER model. So you can now run the code on a CPU by initializing rpunct like so

    rpunct = RestorePuncts(ner_args={"use_cuda": False})
    

    Before this change, when running rpunct examples on the CPU the following error occurs:

    from rpunct import RestorePuncts
    # The default language is 'english'
    rpunct = RestorePuncts()
    rpunct.punctuate("""in 2018 cornell researchers built a high-powered detector that in combination with an algorithm-driven process called ptychography set a world record
    by tripling the resolution of a state-of-the-art electron microscope as successful as it was that approach had a weakness it only worked with ultrathin samples that were
    a few atoms thick anything thicker would cause the electrons to scatter in ways that could not be disentangled now a team again led by david muller the samuel b eckert
    professor of engineering has bested its own record by a factor of two with an electron microscope pixel array detector empad that incorporates even more sophisticated
    3d reconstruction algorithms the resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves""")
    
    

    ValueError Traceback (most recent call last) /var/folders/hx/dhzhl_x51118fm5cd13vzh2h0000gn/T/ipykernel_10548/194907560.py in 1 from rpunct import RestorePuncts 2 # The default language is 'english' ----> 3 rpunct = RestorePuncts() 4 rpunct.punctuate("""in 2018 cornell researchers built a high-powered detector that in combination with an algorithm-driven process called ptychography set a world record 5 by tripling the resolution of a state-of-the-art electron microscope as successful as it was that approach had a weakness it only worked with ultrathin samples that were

    ~/repos/rpunct/rpunct/punctuate.py in init(self, wrds_per_pred, ner_args) 19 if ner_args is None: 20 ner_args = {} ---> 21 self.model = NERModel("bert", "felflare/bert-restore-punctuation", labels=self.valid_labels, 22 args={"silent": True, "max_seq_length": 512}, **ner_args) 23

    ~/repos/transformers/transformer-env/lib/python3.8/site-packages/simpletransformers/ner/ner_model.py in init(self, model_type, model_name, labels, args, use_cuda, cuda_device, onnx_execution_provider, **kwargs) 209 self.device = torch.device(f"cuda:{cuda_device}") 210 else: --> 211 raise ValueError( 212 "'use_cuda' set to True when cuda is unavailable." 213 "Make sure CUDA is available or set use_cuda=False."

    ValueError: 'use_cuda' set to True when cuda is unavailable.Make sure CUDA is available or set use_cuda=False.

    opened by nbertagnolli 1
  • add use_cuda parameter

    add use_cuda parameter

    using the package in an environment without cuda support causes it to fail. Adding the parameter to shut it off if necessary allows it to function normall.

    opened by mjfox3 1
Releases(1.0.1)
Owner
Daulet Nurmanbetov
Deep Learning, AI and Finance
Daulet Nurmanbetov
Gathers machine learning and Tensorflow deep learning models for NLP problems, 1.13 < Tensorflow < 2.0

NLP-Models-Tensorflow, Gathers machine learning and tensorflow deep learning models for NLP problems, code simplify inside Jupyter Notebooks 100%. Tab

HUSEIN ZOLKEPLI 1.7k Dec 30, 2022
ConvBERT-Prod

ConvBERT 目录 0. 仓库结构 1. 简介 2. 数据集和复现精度 3. 准备数据与环境 3.1 准备环境 3.2 准备数据 3.3 准备模型 4. 开始使用 4.1 模型训练 4.2 模型评估 4.3 模型预测 5. 模型推理部署 5.1 基于Inference的推理 5.2 基于Serv

yujun 7 Apr 08, 2022
Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch

COCO LM Pretraining (wip) Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch. They were a

Phil Wang 44 Jul 28, 2022
BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions

BERTopic BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable

Maarten Grootendorst 3.6k Jan 07, 2023
Just Another Telegram Ai Chat Bot Written In Python With Pyrogram.

OkaeriChatBot Just another Telegram AI chat bot written in Python using Pyrogram. Requirements Python 3.7 or higher.

Wahyusaputra 2 Dec 23, 2021
Minimal GUI for accessing the Watson Text to Speech service.

Description Minimal graphical application for accessing the Watson Text to Speech service. Requirements Python 3 plus all dependencies listed in requi

Moritz Maxeiner 1 Oct 22, 2021
Yet Another Compiler Visualizer

yacv: Yet Another Compiler Visualizer yacv is a tool for visualizing various aspects of typical LL(1) and LR parsers. Check out demo on YouTube to see

Ashutosh Sathe 129 Dec 17, 2022
This repository contains helper functions which can help you generate additional data points depending on your NLP task.

NLP Albumentations For Data Augmentation This repository contains helper functions which can help you generate additional data points depending on you

Aflah 6 May 22, 2022
Korean stereoypte detector with TUNiB-Electra and K-StereoSet

Korean Stereotype Detector Korean stereotype sentence classifier using K-StereoSet with TUNiB-Electra Web demo you can test this model easily in demo

Sae_Chan_Oh 11 Feb 18, 2022
TensorFlow code and pre-trained models for BERT

BERT ***** New March 11th, 2020: Smaller BERT Models ***** This is a release of 24 smaller BERT models (English only, uncased, trained with WordPiece

Google Research 32.9k Jan 08, 2023
Tutorial to pretrain & fine-tune a 🤗 Flax T5 model on a TPUv3-8 with GCP

Pretrain and Fine-tune a T5 model with Flax on GCP This tutorial details how pretrain and fine-tune a FlaxT5 model from HuggingFace using a TPU VM ava

Gabriele Sarti 41 Nov 18, 2022
Deep learning for NLP crash course at ABBYY.

Deep NLP Course at ABBYY Deep learning for NLP crash course at ABBYY. Suggested textbook: Neural Network Methods in Natural Language Processing by Yoa

Dan Anastasyev 597 Dec 18, 2022
Various Algorithms for Short Text Mining

Short Text Mining in Python Introduction This package shorttext is a Python package that facilitates supervised and unsupervised learning for short te

Kwan-Yuet 466 Dec 06, 2022
A PyTorch implementation of the WaveGlow: A Flow-based Generative Network for Speech Synthesis

WaveGlow A PyTorch implementation of the WaveGlow: A Flow-based Generative Network for Speech Synthesis Quick Start: Install requirements: pip install

Yuchao Zhang 204 Jul 14, 2022
An implementation of model parallel GPT-2 and GPT-3-style models using the mesh-tensorflow library.

GPT Neo 🎉 1T or bust my dudes 🎉 An implementation of model & data parallel GPT3-like models using the mesh-tensorflow library. If you're just here t

EleutherAI 6.7k Dec 28, 2022
Telegram bot to auto post messages of one channel in another channel as soon as it is posted, without the forwarded tag.

Channel Auto-Post Bot This bot can send all new messages from one channel, directly to another channel (or group, just in case), without the forwarded

Aditya 128 Dec 29, 2022
Stand-alone language identification system

langid.py readme Introduction langid.py is a standalone Language Identification (LangID) tool. The design principles are as follows: Fast Pre-trained

2k Jan 04, 2023
Contains descriptions and code of the mini-projects developed in various programming languages

TexttoSpeechAndLanguageTranslator-project introduction A pleasant application where the client will be given buttons like play,reset and exit. The cli

Adarsh Reddy 1 Dec 22, 2021
Interpretable Models for NLP using PyTorch

This repo is deprecated. Please find the updated package here. https://github.com/EdGENetworks/anuvada Anuvada: Interpretable Models for NLP using PyT

Sandeep Tammu 19 Dec 17, 2022
Python package for performing Entity and Text Matching using Deep Learning.

DeepMatcher DeepMatcher is a Python package for performing entity and text matching using deep learning. It provides built-in neural networks and util

461 Dec 28, 2022