Translate .sbv subtitle files

Overview

deepl4subtitle

Deeplを使って字幕ファイル(.sbv)を翻訳します。タイムスタンプも含めて出力しますが、翻訳時はタイムスタンプは文の一部とは切り離されるので、.sbvファイルをそのまま翻訳機に突っ込むよりも高精度な翻訳ができるはずです。

つかいかた

入力する.sbvファイルの前処理として、文の終わりにピリオド(.)を打っていく。これで、Deeplが文の区切りを正しく認識してくれる。

# install deepl 
# https://pypi.org/project/deepl/
pip3 install deepl
python3 deepl4subtitle.py -i sample.sbv -o output.sbv -k YOUR_DEEPL_API_KEY

サンプル

sample video: https://www.youtube.com/watch?v=CL7HuMLIPO0

  • sample.xbv: Youtubeが自動で生成した字幕を若干手直ししたもの
  • sample_deepl4subtitle.sbv: deepl4subtitleを使って翻訳したもの
  • sample_raw_deepl.sbv: sample.xbvの中身をそのままDeeplにコピペして翻訳したもの

sample_raw_deeplだと、タイムスタンプが文章の一部であることが原因であちこちで怪しい翻訳が発生していたのが、sample_deepl4subtitleでは概ね解消されている。

中でやってること

original

(文末のピリオドは手作業で加える必要がある)

0:00:01.340,0:00:04.780
クラウドコンピューティングという言葉を
知っているだろうか.

0:00:04.780,0:00:08.110
クラウドコンピューティングとは
インターネットの先にあるデータセンター

0:00:08.110,0:00:12.420
のサーバーに処理してもらうシステム形態
を指す言葉である.

↓ move timestamp within XML tag, remove newlines

クラウドコンピューティングという言葉を知っているだろうか.クラウドコンピューティングとはインターネットの先にあるデータセンターのサーバーに処理してもらうシステム形態を指す言葉である. ">
<timestamp ts="0:00:01.340,0:00:04.780"/>クラウドコンピューティングという言葉を知っているだろうか.<timestamp ts="0:00:04.780,0:00:08.110"/>クラウドコンピューティングとはインターネットの先にあるデータセンター<timestamp ts="0:00:08.110,0:00:12.420"/>のサーバーに処理してもらうシステム形態を指す言葉である.

↓ translate with Deepl through API, ignoring XML tags

Do you know the term "cloud computing"? Cloud computing is a term that refers to a form of system that is processed by servers in a data center located beyond the Internet. ">
<timestamp ts="0:00:01.340,0:00:04.780"/>Do you know the term "cloud computing"? <timestamp ts="0:00:04.780,0:00:08.110"/> Cloud computing is a term that refers to a form of system that is processed by servers in a data center <timestamp ts="0:00:08.110,0:00:12.420"/>located beyond the Internet. 

↓ put back timestamp and newlines

0:00:01.340,0:00:04.780
Do you know the term "cloud computing"? 

0:00:04.780,0:00:08.110
 Cloud computing is a term that refers to a form of system that is processed by servers in a data center 

0:00:08.110,0:00:12.420
located beyond the Internet. 
Owner
Yasunori Toshimitsu
Yasunori Toshimitsu
Map Reduce Wordcount in Python using gRPC

This project is implemented in Python using gRPC. The input files are given in .txt format and the word count operation is performed.

Divija 4 Dec 05, 2022
一款高性能敏感词(非法词/脏字)检测过滤组件,附带繁体简体互换,支持全角半角互换,汉字转拼音,模糊搜索等功能。

一款高性能非法词(敏感词)检测组件,附带繁体简体互换,支持全角半角互换,获取拼音首字母,获取拼音字母,拼音模糊搜索等功能。

ToolGood 3.6k Jan 07, 2023
StealBit1.1 and earlier strings and config extraction scripts

StealBit1.1 and earlier scripts Use strings_decryptor.py to extract RC4 encrypted strings from a StealBit1.1 sample(s). Use config_extractor.py to ext

Soolidsnake 5 Dec 29, 2022
Word-Generator - Generates meaningful words from dictionary with given no. of letters and words.

Meaningful Word Generator Generates meaningful words from dictionary with given no. of letters and words. This might be useful for generating short li

Mohammed Rabil 1 Jan 01, 2022
RSS Reader application for the Emacs Application Framework.

EAF RSS Reader RSS Reader application for the Emacs Application Framework. Load application (add-to-list 'load-path "~/.emacs.d/site-lisp/eaf-rss-read

EAF 15 Dec 07, 2022
Word and phrase lists in CSV

Word Lists Word and phrase lists in CSV, collected from different sources. Oxford Word Lists: oxford-5k.csv - Oxford 3000 and 5000 oxford-opal.csv - O

Anton Zhiyanov 14 Oct 14, 2022
知乎评论区词云分析

zhihu-comment-wordcloud 知乎评论区词云分析 起源于:如何看待知乎问题“男生真的很不能接受彩礼吗?”的一个回答下评论数超8万条,创单个回答下评论数新记录? 项目代码说明 2.download_comment.py 下载全量评论 2.word_cloud_by_dt 生成词云 2

李国宝 10 Sep 26, 2022
ChirpText is a collection of text processing tools for Python 3.

ChirpText is a collection of text processing tools for Python 3. It is not meant to be a powerful tank like the popular NTLK but a small package which

Le Tuan Anh 5 Nov 30, 2022
A pipeline for making highlighted text stand-alone.

title emoji colorFrom colorTo sdk app_file pinned decontextualizer 📤 green gray streamlit main.py false Decontextualizer As a second step in improvin

Paul Bricman 26 Dec 17, 2022
Parse Any Text With Python

ParseAnyText A small package to parse strings. What is the work of it? Well It's a module to creates parser that helps to parse a text easily with les

Sayam Goswami 1 Jan 11, 2022
Um simulador de caixa registradora com database usando arquivos .txt

🛒 Caixa Registradora V2 ❓ - Como usar? Execute o caixa-registradora.py, nele vai ter um menu interativo, você pode cadastrar diversos produtos em um

Gabriel 0 Sep 25, 2022
Add your new words to a text file and get them randomly.

Memorize-New-Words In this very very very little project, I've wrote a code to memorize new english words. Therefore you can add the words and their m

Mostafa 2 Jul 04, 2022
A username generator made from French Canadian most common names.

This script is used to generate a username list using the most common first and last names in Quebec in different formats. It can generate some passwords using specific patterns such as Tremblay2020.

5 Nov 26, 2022
Format Covid values to ASCII-Table (Only for Germany and Austria)

Covid-19-Formatter (Only for Germany and Austria) Dieses Script speichert die gemeldeten Daten des RKIs / BMSGPK und formatiert diese zu einer Asci Ta

56 Jan 22, 2022
Migrates translations to the REDCap native Multi-Language Management system

Automates much of the process of moving translations from the old Multilingual external module to the newer built-in Multi-Language Management (MLM) page.

UCI MIND 3 Sep 27, 2022
Maiden & Spell community player ranking based on tournament data.

MnSRank Maiden & Spell community player ranking based on tournament data. Why? 2021 just ended and this seemed like a cool idea. Elo doesn't work well

Jonathan Lee 1 Apr 20, 2022
Translate .sbv subtitle files

deepl4subtitle Deeplを使って字幕ファイル(.sbv)を翻訳します。タイムスタンプも含めて出力しますが、翻訳時はタイムスタンプは文の一部とは切り離されるので、.sbvファイルをそのまま翻訳機に突っ込むよりも高精度な翻訳ができるはずです。 つかいかた 入力する.sbvファイルの前処理

Yasunori Toshimitsu 1 Oct 20, 2021
Split large XML files into smaller ones for easy upload

Split large XML files into smaller ones for easy upload. Works for WordPress Posts Import and other XML files.

Joseph Adediji 1 Jan 30, 2022
pydantic-i18n is an extension to support an i18n for the pydantic error messages.

pydantic-i18n is an extension to support an i18n for the pydantic error messages

Boardpack 48 Dec 21, 2022
The project is investigating methods to extract human-marked data from document forms such as surveys and tests.

The project is investigating methods to extract human-marked data from document forms such as surveys and tests. They can read questions, multiple-choice exam papers, and grade.

Harry 5 Mar 27, 2022