This repository has a implementations of data augmentation for NLP for Japanese.

Last update: Nov 11, 2022

Related tags

Text Data & NLP daaja

Overview

daaja

This repository has a implementations of data augmentation for NLP for Japanese:

EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks
An Analysis of Simple Data Augmentation for Named Entity Recognition

Install

pip install daaja

How to use

EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks

Command

python -m aug_ja.eda.run --input input.tsv --output data_augmentor.tsv

The format of input.tsv is as follows:

1	この映画はとてもおもしろい
0	つまらない映画だった

In Python

from aug_ja.eda import EasyDataAugmentor
augmentor = EasyDataAugmentor(alpha_sr=0.1, alpha_ri=0.1, alpha_rs=0.1, p_rd=0.1, num_aug=4)
text = "日本語でデータ拡張を行う"
aug_texts = augmentor.augments(text)
print(aug_texts)
# ['日本語でを拡張データ行う', '日本語でデータ押広げるを行う', '日本語でデータ拡張を行う', '日本語で智見拡張を行う', '日本語でデータ拡張を行う']

An Analysis of Simple Data Augmentation for Named Entity Recognition

Command

python -m aug_ja.ner_sda.run --input input.tsv --output data_augmentor.tsv

The format of input.tsv is as follows:

私	O
は	O
田中	B-PER
と	O
いい	O
ます	O

In Python

from daaja.ner_sda import SimpleDataAugmentationforNER
tokens_list = [
    ["私", "は", "田中", "と", "いい", "ます"],
    ["筑波", "大学", "に", "所属", "して", "ます"],
    ["今日", "から", "筑波", "大学", "に", "通う"],
    ["茨城", "大学"],
]
labels_list = [
    ["O", "O", "B-PER", "O", "O", "O"],
    ["B-ORG", "I-ORG", "O", "O", "O", "O"],
    ["B-DATE", "O", "B-ORG", "I-ORG", "O", "O"],
    ["B-ORG", "I-ORG"],
]
augmentor = SimpleDataAugmentationforNER(tokens_list=tokens_list, labels_list=labels_list,
                                            p_power=1, p_lwtr=1, p_mr=1, p_sis=1, p_sr=1, num_aug=4)
tokens = ["吉田", "さん", "は", "株式", "会社", "A", "に", "出張", "予定", "だ"]
labels = ["B-PER", "O", "O", "B-ORG", "I-ORG", "I-ORG", "O", "O", "O", "O"]
augmented_tokens_list, augmented_labels_list = augmentor.augments(tokens, labels)
print(augmented_tokens_list)
# [['吉田', 'さん', 'は', '株式', '会社', 'A', 'に', '出張', '志す', 'だ'],
#  ['吉田', 'さん', 'は', '株式', '大学', '大学', 'に', '出張', '予定', 'だ'],
#  ['吉田', 'さん', 'は', '株式', '会社', 'A', 'に', '出張', '予定', 'だ'],
#  ['吉田', 'さん', 'は', '筑波', '大学', 'に', '出張', '予定', 'だ'],
#  ['吉田', 'さん', 'は', '株式', '会社', 'A', 'に', '出張', '予定', 'だ']]
print(augmented_labels_list)
# [['B-PER', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O', 'O', 'O', 'O'],
#  ['B-PER', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O', 'O', 'O', 'O'],
#  ['B-PER', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O', 'O', 'O', 'O'],
#  ['B-PER', 'O', 'O', 'B-ORG', 'I-ORG', 'O', 'O', 'O', 'O'],
#  ['B-PER', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O', 'O', 'O', 'O']]

Reference

Comments

too many progress bars

When I use EasyDataAugmentor in the train process, there are too many progress bars in the console.

So, can you make this line 19 tqdm selectable on-off when we define EasyDataAugmentor? https://github.com/kajyuuen/daaja/blob/12835943868d43f5c248cf1ea87ab60f67a6e03d/daaja/flows/sequential_flow.py#L19

opened by Yongtae723 6
from daaja.methods.eda.easy_data_augmentor import EasyDataAugmentorにてエラー

daajaをpipインストール後、from daaja.methods.eda.easy_data_augmentor import EasyDataAugmentorを行うと、以下のエラーとなる。 ConnectionError: HTTPConnectionPool(host='compling.hss.ntu.edu.sg', port=80): Max retries exceeded with url: /wnja/data/1.1/wnjpn.db.gz (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f3b6a6cced0>: Failed to establish a new connection: [Errno 110] Connection timed out'))

opened by naoki1213mj 5
is it possible to use on GPU device?

Hi!

thank you for the great library. when I train with this augmentation, this takes so much more time than forward and backward process.

therefore, can we possibly use this augmentation on GPU to save time?

thank you

opened by Yongtae723 3
Bump joblib from 1.1.0 to 1.2.0
Bumps joblib from 1.1.0 to 1.2.0.

Changelog

Sourced from joblib's changelog.

Release 1.2.0

Fix a security issue where eval(pre_dispatch) could potentially run arbitrary code. Now only basic numerics are supported. joblib/joblib#1327

Make sure that joblib works even when multiprocessing is not available, for instance with Pyodide joblib/joblib#1256

Avoid unnecessary warnings when workers and main process delete the temporary memmap folder contents concurrently. joblib/joblib#1263

Fix memory alignment bug for pickles containing numpy arrays. This is especially important when loading the pickle with mmap_mode != None as the resulting numpy.memmap object would not be able to correct the misalignment without performing a memory copy. This bug would cause invalid computation and segmentation faults with native code that would directly access the underlying data buffer of a numpy array, for instance C/C++/Cython code compiled with older GCC versions or some old OpenBLAS written in platform specific assembly. joblib/joblib#1254

Vendor cloudpickle 2.2.0 which adds support for PyPy 3.8+.

Vendor loky 3.3.0 which fixes several bugs including:

robustly forcibly terminating worker processes in case of a crash (joblib/joblib#1269);

avoiding leaking worker processes in case of nested loky parallel calls;

reliability spawn the correct number of reusable workers.

Release 1.1.1

Fix a security issue where eval(pre_dispatch) could potentially run arbitrary code. Now only basic numerics are supported. joblib/joblib#1327

Commits

5991350 Release 1.2.0

3fa2188 MAINT cleanup numpy warnings related to np.matrix in tests (#1340)

cea26ff CI test the future loky-3.3.0 branch (#1338)

8aca6f4 MAINT: remove pytest.warns(None) warnings in pytest 7 (#1264)

067ed4f XFAIL test_child_raises_parent_exits_cleanly with multiprocessing (#1339)

ac4ebd5 MAINT add back pytest warnings plugin (#1337)

a23427d Test child raises parent exits cleanly more reliable on macos (#1335)

ac09691 [MAINT] various test updates (#1334)

4a314b1 Vendor loky 3.2.0 (#1333)

bdf47e9 Make test_parallel_with_interactively_defined_functions_default_backend timeo...

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0
Implement Data Augmentation using Pre-trained Transformer Models
paper

Data Augmentation using Pre-trained Transformer Models

code

https://github.com/varunkumar-dev/TransformersDataAugmentation

ref

https://www.ai-shift.co.jp/techblog/1939

add-new-technique
opened by kajyuuen 0
Implement Contextual Augmentation
Paper

Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations

Code

https://github.com/pfnet-research/contextual_augmentation

add-new-technique
opened by kajyuuen 0
Implement MixText
Paper

MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification

Code

https://github.com/GT-SALT/MixText

add-new-technique
opened by kajyuuen 0

Releases(v0.0.7)

v0.0.7(Oct 24, 2022)
Changes

Change pytest @kajyuuen (#35 #37 #38)

Change WORDNER_URL @kajyuuen (#34)

Source code(tar.gz)
Source code(zip)
daaja-0.0.7-py3-none-any.whl(18.19 KB)
v0.0.6(Mar 3, 2022)
Changes

Update version @kajyuuen (#27)

Add verbose option @kajyuuen (#25)

📖 Documentation

Add README_ja.md and Update README.md @kajyuuen (#26)

Source code(tar.gz)
Source code(zip)
v0.0.5(Feb 27, 2022)
Changes

💪 Enhancement

Add ContextualAugmentor @kajyuuen (#23)

Add BackTranslationAugmentor @kajyuuen (#21 , #22)

📖 Documentation

Add quick_example @kajyuuen (#17)

Source code(tar.gz)
Source code(zip)
v0.0.4(Feb 21, 2022)
Changes

Release v0.0.4 @kajyuuen (#16)

Chore add release drafter @kajyuuen (#6)

💪 Enhancement

Add tqdm @kajyuuen (#8)

📖 Documentation

Refactoring @kajyuuen (#15)

Add SDA example @kajyuuen (#9)

Add EDA example @kajyuuen (#7)

Source code(tar.gz)
Source code(zip)
v0.0.3(Feb 13, 2022)

Source code(tar.gz)
Source code(zip)
daaja-0.0.3-py3-none-any.whl(14.80 KB)
v0.0.2(Feb 13, 2022)

Source code(tar.gz)
Source code(zip)
daaja-0.0.2-py3-none-any.whl(14.97 KB)

Owner

Koga Kobayashi

GitHub Repository

fastai ulmfit - Pretraining the Language Model, Fine-Tuning and training a Classifier

fast.ai ULMFiT with SentencePiece from pretraining to deployment Motivation: Why even bother with a non-BERT / Transformer language model? Short answe

26 May 27, 2022

NLP and Text Generation Experiments in TensorFlow 2.x / 1.x

Code has been run on Google Colab, thanks Google for providing computational resources Contents Natural Language Processing（自然语言处理） Text Classificati

1.5k Nov 14, 2022

端到端的长本文摘要模型（法研杯2020司法摘要赛道）

端到端的长文本摘要模型（法研杯2020司法摘要赛道）

334 Jan 08, 2023

DeLighT: Very Deep and Light-Weight Transformers

DeLighT: Very Deep and Light-weight Transformers This repository contains the source code of our work on building efficient sequence models: DeFINE (I

440 Dec 18, 2022

基于GRU网络的句子判断程序/A program based on GRU network for judging sentences

SentencesJudger SentencesJudger 是一个基于GRU神经网络的句子判断程序，基本的功能是判断文章中的某一句话是否为一个优美的句子。 English 如何使用SentencesJudger 确认Python运行环境安装pyTorch与LTP python3 -m pip

8 Mar 24, 2022

This library is testing the ethics of language models by using natural adversarial texts.

prompt2slip This library is testing the ethics of language models by using natural adversarial texts. This tool allows for short and simple code and v

9 Dec 28, 2021

A modular Karton Framework service that unpacks common packers like UPX and others using the Qiling Framework.

Unpacker Karton Service A modular Karton Framework service that unpacks common packers like UPX and others using the Qiling Framework. This project is

45 Jan 05, 2023

LUKE -- Language Understanding with Knowledge-based Embeddings

LUKE (Language Understanding with Knowledge-based Embeddings) is a new pre-trained contextualized representation of words and entities based on transf

587 Dec 30, 2022

Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors"

SWRM Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors" Clone Clone th

14 Jan 03, 2023

The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

VAENAR-TTS This repo contains code accompanying the paper "VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis". Sa

138 Oct 28, 2022

A simple version of DeTR

DeTR-Lite A simple version of DeTR Before you enjoy this DeTR-Lite The purpose of this project is to allow you to learn the basic knowledge of DeTR. P

11 Jun 13, 2022

The implementation of Parameter Differentiation based Multilingual Neural Machine Translation

The implementation of Parameter Differentiation based Multilingual Neural Machine Translation .

21 Dec 17, 2022

Search with BERT vectors in Solr and Elasticsearch

123 Dec 29, 2022

Python SDK for working with Voicegain Speech-to-Text

Voicegain Speech-to-Text Python SDK Python SDK for the Voicegain Speech-to-Text API. This API allows for large vocabulary speech-to-text transcription

3 Dec 14, 2022

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

MMF is a modular framework for vision and language multimodal research from Facebook AI Research. MMF contains reference implementations of state-of-t

5.1k Dec 26, 2022

A repository to run gpt-j-6b on low vram machines (4.2 gb minimum vram for 2000 token context, 3.5 gb for 1000 token context). Model loading takes 12gb free ram.

Basic-UI-for-GPT-J-6B-with-low-vram A repository to run GPT-J-6B on low vram systems by using both ram, vram and pinned memory. There seem to be some

90 Dec 25, 2022

Sequence model architectures from scratch in PyTorch

This repository implements a variety of sequence model architectures from scratch in PyTorch. Effort has been put to make the code well structured so that it can serve as learning material. The train

11 Mar 28, 2022

Part of Speech Tagging using Hidden Markov Model (HMM) POS Tagger and Brill Tagger

Part of Speech Tagging using Hidden Markov Model (HMM) POS Tagger and Brill Tagger In this project, our aim is to tune, compare, and contrast the perf

0 Dec 25, 2021

A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.

A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to ach

237 Jan 02, 2023

A Japanese tokenizer based on recurrent neural networks

Nagisa is a python module for Japanese word segmentation/POS-tagging. It is designed to be a simple and easy-to-use tool. This tool has the following

325 Jan 05, 2023

This repository has a implementations of data augmentation for NLP for Japanese.

Related tags

Overview

daaja

Install

How to use

Command

In Python

Command

In Python

Comments

Release 1.2.0

Release 1.1.1

Releases(v0.0.7)

v0.0.7(Oct 24, 2022)

Changes

v0.0.6(Mar 3, 2022)

Changes

📖 Documentation

v0.0.5(Feb 27, 2022)

Changes

💪 Enhancement

📖 Documentation

v0.0.4(Feb 21, 2022)

Changes

💪 Enhancement

📖 Documentation

v0.0.3(Feb 13, 2022)

v0.0.2(Feb 13, 2022)

Owner

Koga Kobayashi

fastai ulmfit - Pretraining the Language Model, Fine-Tuning and training a Classifier

NLP and Text Generation Experiments in TensorFlow 2.x / 1.x

端到端的长本文摘要模型（法研杯2020司法摘要赛道）

DeLighT: Very Deep and Light-Weight Transformers

基于GRU网络的句子判断程序/A program based on GRU network for judging sentences

This library is testing the ethics of language models by using natural adversarial texts.

A modular Karton Framework service that unpacks common packers like UPX and others using the Qiling Framework.

LUKE -- Language Understanding with Knowledge-based Embeddings

Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors"

The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

A simple version of DeTR

The implementation of Parameter Differentiation based Multilingual Neural Machine Translation

Search with BERT vectors in Solr and Elasticsearch

Python SDK for working with Voicegain Speech-to-Text

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

A repository to run gpt-j-6b on low vram machines (4.2 gb minimum vram for 2000 token context, 3.5 gb for 1000 token context). Model loading takes 12gb free ram.

Sequence model architectures from scratch in PyTorch

Part of Speech Tagging using Hidden Markov Model (HMM) POS Tagger and Brill Tagger

A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate TTS.

A Japanese tokenizer based on recurrent neural networks