Model for recasing and repunctuating ASR transcripts

Last update: Dec 29, 2022

Related tags

Text Data & NLP recasepunc

Overview

Recasing and punctuation model based on Bert

Benoit Favre 2021

This system converts a sequence of lowercase tokens without punctuation to a sequence of cased tokens with punctuation.

It is trained to predict both aspects at the token level in a multitask fashion, from fine-tuned BERT representations.

The model predicts the following recasing labels:

lower: keep lowercase
upper: convert to upper case
capitalize: set first letter as upper case
other: left as is

And the following punctuation labels:

o: no punctuation
period: .
comma: ,
question: ?
exclamation: !

Input tokens are batched as sequences of length 256 that are processed independently without overlap.

In training, batches containing less that 256 tokens are simulated by drawing uniformly a length and replacing all tokens and labels after that point with padding (called Cut-drop).

Changelong:

Fix generation when input is smaller than max length

Installation

Use your favourite method for installing Python requirements. For example:

python -mvenv env
. env/bin/activate
pip3 install -r requirements.txt -f https://download.pytorch.org/whl/torch_stable.html

Prediction

Predict from raw text:

python recasepunc.py predict checkpoint/path.iteration < input.txt > output.txt

Models

French: fr-txt.large.19000 trained on 160M tokens from Common Crawl
- Iterations: 19000
- Batch size: 16
- Max length: 256
- Seed: 871253
- Cut-drop probability: 0.1
- Train loss: 0.021128975618630648
- Valid loss: 0.015684964135289192
- Recasing accuracy: 96.73
- Punctuation accuracy: 95.02
  - All punctuation F-score: 67.79
  - Comma F-score: 67.94
  - Period F-score: 72.91
  - Question F-score: 57.57
  - Exclamation mark F-score: 15.78
- Training data: First 100M words from Common Crawl

Training

Notes: You need to modify file names adequately. Training tensors are precomputed and loaded in CPU memory.

Stage 0: download text data

Stage 1: tokenize and normalize text with Moses tokenizer, and extract recasing and repunctuation labels

python recasepunc.py preprocess < input.txt > input.case+punc

Stage 2: sub-tokenize with Flaubert tokenizer, and generate pytorch tensors

python recasepunc.py tensorize input.case+punc input.case+punc.x input.case+punc.y

Stage 3: train model

python recasepunc.py train train.x train.y valid.x valid.y checkpoint/path

Stage 4: evaluate performance on a test set

python recasepunc.py eval checkpoint/path.iteration test.x test.y

Comments

Is it possible to customize for new language?

Dear Benoit Favre,

Your project is really important! Is it possible to customize for new language? If yes, could you tell short hints for it?

Thank you in advance!

opened by ican24 5
Can't get attribute 'WordpieceTokenizer'

Hi thanks for your effort on developing recasepunc! I know that you can't provide help for models not trained by you, but maybe you have an idea what's going wrong here:

I'm loading the model vosk-recasepunc-de-0.21 from https://alphacephei.com/vosk/models. When I do so, torch tells me that it can't find WordpieceTokenizer. Do you know why? Is the model incompatible?

Punc predict path: C:\Users\admin\meety\vosk-recasepunc-de-0.21\checkpoint Traceback (most recent call last): File "main2.py", line 120, in t = transcriber() File "main2.py", line 32, in init self.casePuncPredictor = CasePuncPredictor(punc_predict_path, lang="de") File "C:\Users\admin\meety\recasepunc.py", line 273, in init loaded = torch.load(checkpoint_path, map_location=device if torch.cuda.is_available() else 'cpu') File "C:\Users\admin\Anaconda3\envs\meety\lib\site-packages\torch\serialization.py", line 607, in load return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args) File "C:\Users\admin\Anaconda3\envs\meety\lib\site-packages\torch\serialization.py", line 882, in _load result = unpickler.load() File "C:\Users\admin\Anaconda3\envs\meety\lib\site-packages\torch\serialization.py", line 875, in find_class return super().find_class(mod_name, name) AttributeError: Can't get attribute 'WordpieceTokenizer' on <module 'main' from 'main2.py'>

opened by padmalcom 4
Can't do inference

Hello, I'm trying to use example.py on a french model (fr.22000 or fr-txt.large.19000) But I have this error: raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for Model: Unexpected key(s) in state_dict: "bert.position_ids". I also tried with the following command, same error in output. python recasepunc.py predict fr.22000 < toto.txt > output.txt Do you have any advice? Thanks

opened by MatFrancois 3
Memory usage

Hi, on start punctuation app use about 9Gb RAM, but in one moment(in load model ). Then we need about 1.5GB. Can we reduce 9GB on start? maybe on start we check our model and it feature can be turn off?

opened by gubri 1

Russian model doesn't work, while English does

When I use Russian model, it gives me this error:

WARNING: reverting to cpu as cuda is not available
Some weights of the model checkpoint at DeepPavlov/rubert-base-cased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight']

 File "C:\pypy\rus\recasepunc.py", line 741, in <module>
    main(config, config.action, config.action_args)
  File "C:\pypy\rus\recasepunc.py", line 715, in main
    generate_predictions(config, *args)
  File "C:\pypy\rus\recasepunc.py", line 349, in generate_predictions
    for line in sys.stdin:
  File "C:\Users\Xenia\AppData\Local\Programs\Python\Python39\lib\codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xef in position 0: invalid continuation byte

 File "C:\Users\Xenia\AppData\Local\Programs\Python\Python39\lib\site-packages\flask\app.py", line 1796, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)      
  File "C:\pypy\app.py", line 32, in process_audio
    cased = subprocess.check_output('python rus/recasepunc.py predict rus/checkpoint', shell=True, text=True, input=text)
  File "C:\Users\Xenia\AppData\Local\Programs\Python\Python39\lib\subprocess.py", 
line 420, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "C:\Users\Xenia\AppData\Local\Programs\Python\Python39\lib\subprocess.py", 
line 524, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'python rus/recasepunc.py predict rus/checkpoint' returned non-zero exit status 1.

Sorry for a long message, I'm not sure which of these messages are the most important. Should I use another version of transformers? I use transformers==4.16.2 and it works fine with English model.

opened by xenia19 0

Export model to be Used in C++
Is it possible that export model to something that can be used in C++ using libtorch?

export existing model(checkpoint provided in this repo)

export model after I train with my own data which option above possible, or both?
opened by leohuang2013 0
While running pretrained German model: AttributeError: Can't get attribute 'Trie' on

I am trying to use pretrained German model:

https://alphacephei.com/vosk/models/vosk-recasepunc-de-0.21.zip

and as mentioned in readme file, I run:

python example.py de-test.txt

but I keep getting following error:

AttributeError: Can't get attribute 'Trie' on <module 'transformers.tokenization_utils' from '/home/ali/ali_initos_work/internal/data_science/speech_to_text/vosk/vosk_env/lib/python3.7/site-packages/transformers/tokenization_utils.py'>

Any idea if the model itself is wrong?

opened by alihashaam 2

RuntimeError when predicting with the french models

I tried to use the french models (both fr.22000 and fr-txt.large.19000) on a very simple text:

j'aime les fleurs les olives et la raclette

When running python3 recasepunc.py predict fr.22000 < input.txt > output.txt (or with the other model), I get the following RuntimeError:

Traceback (most recent call last): File "/home/mael/charly/recasepunc/recasepunc.py", line 733, in <module> main(config, config.action, config.action_args) File "/home/mael/charly/recasepunc/recasepunc.py", line 707, in main generate_predictions(config, *args) File "/home/mael/charly/recasepunc/recasepunc.py", line 336, in generate_predictions model.load_state_dict(loaded['model_state_dict']) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1497, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for Model: Unexpected key(s) in state_dict: "bert.position_ids".

I tried the same with the english model, and it worked perfectly. Looks like something is broken with the french ones?

opened by maelchiotti 2
parameters like --dab_rate can't be set from cmd line bc they are bool
look at parameters below. They really became bool, i find this bug while debugging it. ''' if name == 'main': parser = argparse.ArgumentParser() parser.add_argument("action", help="train|eval|predict|tensorize|preprocess", type=str) ... parser.add_argument("--updates", help="number of training updates to perform", default=default_config.updates, type=bool) parser.add_argument("--period", help="validation period in updates", default=default_config.period, type=bool) parser.add_argument("--lr", help="learning rate", default=default_config.lr, type=bool) parser.add_argument("--dab-rate", help="drop at boundaries rate", default=default_config.dab_rate, type=bool) config = Config(**parser.parse_args().dict)

main(config, config.action, config.action_args)

'''
opened by al-zatv 0
Cannot use trained model for validation or prediction

Hi, thank you for this repo! I'm trying to reproduce results for different language, so I'm using multilingual-bert fine-tuned to my language dataset. Everything goes well during preprocessing and training, the resuls are comparable with those for English and French (97-99% for case and punctuation).

But when I try to use trained model, it gives very poor results even for sentences from training dataset. It works, sometimes it puts capital letters or dots, but it's rare and mostly model can't handle. Also when I try to evaluate model with command from the README (also tried it for already used validation sets, for instance with command python recasepunc.py eval bertugan_casepunc.24000 valid.case+punc.x valid.case+punc.y) it gives error:

File "recasepunc.py", line 220, in batchify x = x[:(len(x) // max_length) * max_length].reshape(-1, max_length) TypeError: unhashable type: 'slice'

Sorry for pointing to two different problems in one Issue, but I though maybe it can be one common mistake for both cases.

opened by khusainovaidar 5

Releases(0.3)

0.3(Feb 3, 2022)

Checkpoint release
Source code(tar.gz)
Source code(zip)
en.23000(1249.49 MB)
fr-txt.large.19000(523.93 MB)
fr.22000(1575.50 MB)
zh.24000(1166.63 MB)
0.2(Sep 26, 2021)

Fix predictions when input is shorter than max length
Source code(tar.gz)
Source code(zip)
0.1(Sep 20, 2021)

First French model trained on 160M tokens from common crawl.
Source code(tar.gz)
Source code(zip)
fr-txt.large.19000(1571.78 MB)

Owner

Benoit Favre

GitHub Repository

DiY Oxygen Concentrator based on the OxiKit

M19O2 DiY Oxygen Concentrator based on / inspired by the OxiKit, OpenOx, Marut, RepRap and Project Apollo platforms. About Read about the project on H

62 Dec 22, 2022

Awesome-NLP-Research (ANLP)

72 Dec 19, 2022

Code for CodeT5: a new code-aware pre-trained encoder-decoder model.

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation This is the official PyTorch implementation

564 Jan 08, 2023

Official PyTorch Implementation of paper "NeLF: Neural Light-transport Field for Single Portrait View Synthesis and Relighting", EGSR 2021.

NeLF: Neural Light-transport Field for Single Portrait View Synthesis and Relighting Official PyTorch Implementation of paper "NeLF: Neural Light-tran

38 Dec 26, 2022

Deal or No Deal? End-to-End Learning for Negotiation Dialogues

Introduction This is a PyTorch implementation of the following research papers: (1) Hierarchical Text Generation and Planning for Strategic Dialogue (

1.4k Dec 29, 2022

A workshop with several modules to help learn Feast, an open-source feature store

Workshop: Learning Feast This workshop aims to teach users about Feast, an open-source feature store. We explain concepts & best practices by example,

52 Jan 05, 2023

Extract city and country mentions from Text like GeoText without regex, but FlashText, a Aho-Corasick implementation.

flashgeotext ⚡ 🌍 Extract and count countries and cities (+their synonyms) from text, like GeoText on steroids using FlashText, a Aho-Corasick impleme

57 Dec 16, 2022

This repository describes our reproducible framework for assessing self-supervised representation learning from speech

LeBenchmark: a reproducible framework for assessing SSL from speech Self-Supervised Learning (SSL) using huge unlabeled data has been successfully exp

49 Aug 24, 2022

Deep Learning Topics with Computer Vision & NLP

Deep learning Udacity Course Deep Learning Topics with Computer Vision & NLP for the AWS Machine Learning Engineer Nanodegree Program Tasks are mostly

1 Jan 20, 2022

vits chinese, tts chinese, tts mandarin

vits chinese, tts chinese, tts mandarin 史上训练最简单，音质最好的语音合成系统

12 Dec 14, 2022

A python script that will use hydra to get user and password to login to ssh, ftp, and telnet

Hydra-Auto-Hack A python script that will use hydra to get user and password to login to ssh, ftp, and telnet Project Description This python script w

2 Jan 16, 2022

Code for "Generative adversarial networks for reconstructing natural images from brain activity".

Reconstruct handwritten characters from brains using GANs Example code for the paper "Generative adversarial networks for reconstructing natural image

2 May 17, 2022

Задания КЕГЭ по информатике 2021 на Python

КЕГЭ 2021 на Python В этом репозитории мои решения типовых заданий КЕГЭ по информатике в 2021 году, БЕСПЛАТНО! Задания Взяты с https://inf-ege.sdamgia

8 Oct 13, 2022

All the code I wrote for Overwatch-related projects that I still own the rights to.

overwatch_shit.zip This is (eventually) going to contain all the software I wrote during my five-year imprisonment stay playing Overwatch. I'll be add

2 Dec 31, 2021

Jupyter Notebook tutorials on solving real-world problems with Machine Learning & Deep Learning using PyTorch

Jupyter Notebook tutorials on solving real-world problems with Machine Learning & Deep Learning using PyTorch. Topics: Face detection with Detectron 2, Time Series anomaly detection with LSTM Autoenc

1.8k Dec 31, 2022

Tools, wrappers, etc... for data science with a concentration on text processing

Rosetta Tools for data science with a focus on text processing. Focuses on "medium data", i.e. data too big to fit into memory but too small to necess

207 Nov 22, 2022

Watson Natural Language Understanding and Knowledge Studio

Material de demonstração dos serviços: Watson Natural Language Understanding e Knowledge Studio Visão Geral: https://www.ibm.com/br-pt/cloud/watson-na

4 Oct 24, 2021

Contact Extraction with Question Answering.

contactsQA Extraction of contact entities from address blocks and imprints with Extractive Question Answering. Goal Input: Dr. Max Mustermann Hauptstr

2 Apr 20, 2022

BeautyNet is an AI powered model which can tell you whether you're beautiful or not.

BeautyNet BeautyNet is an AI powered model which can tell you whether you're beautiful or not. Download Dataset from here:https://www.kaggle.com/gpios

0 May 06, 2022

This repository structures data in title, summary, tags, sentiment given a fragment of a conversation

Understand-conversation-AI This repository structures data in title, summary, tags, sentiment given a fragment of a conversation How to install: pip i

1 Jan 11, 2022