A Lightweight NLP Data Loader for All Deep Learning Frameworks in Python

Overview

LineFlow: Framework-Agnostic NLP Data Loader in Python

CI codecov

LineFlow is a simple text dataset loader for NLP deep learning tasks.

  • LineFlow was designed to use in all deep learning frameworks.
  • LineFlow enables you to build pipelines via functional APIs (.map, .filter, .flat_map).
  • LineFlow provides common NLP datasets.

LineFlow is heavily inspired by tensorflow.data.Dataset and chainer.dataset.

Basic Usage

lineflow.TextDataset expects line-oriented text files:

import lineflow as lf


'''/path/to/text will be expected as follows:
i 'm a line 1 .
i 'm a line 2 .
i 'm a line 3 .
'''
ds = lf.TextDataset('/path/to/text')

ds.first()  # "i 'm a line 1 ."
ds.all() # ["i 'm a line 1 .", "i 'm a line 2 .", "i 'm a line 3 ."]
len(ds)  # 3
ds.map(lambda x: x.split()).first()  # ["i", "'m", "a", "line", "1", "."]

Example

  • Please check out the examples to see how to use LineFlow, especially for tokenization, building vocabulary, and indexing.

Loads Penn Treebank:

>>> import lineflow.datasets as lfds
>>> train = lfds.PennTreebank('train')
>>> train.first()
' aer banknote berlitz calloway centrust cluett fromstein gitano guterman hydro-quebec ipo kia memotec mlx nahb punts rake regatta rubens sim snack-food ssangyong swapo wachter '

Splits the sentence to the words:

>>> # continuing from above
>>> train = train.map(str.split)
>>> train.first()
['aer', 'banknote', 'berlitz', 'calloway', 'centrust', 'cluett', 'fromstein', 'gitano', 'guterman', 'hydro-quebec', 'ipo', 'kia', 'memotec', 'mlx', 'nahb', 'punts', 'rake', 'regatta', 'rubens', 'sim', 'snack-food', 'ssangyong', 'swapo', 'wachter']

Obtains words in dataset:

>>> # continuing from above
>>> words = train.flat_map(lambda x: x)
>>> words.take(5) # This is useful to build vocabulary.
['aer', 'banknote', 'berlitz', 'calloway', 'centrust']

Further more:

Requirements

  • Python3.6+

Installation

To install LineFlow:

pip install lineflow

Datasets

Is the dataset you want to use not supported? Suggest a new dataset 🎉

Commonsense Reasoning

CommonsenseQA

Loads the CommonsenseQA dataset:

>> dev = lfds.CommonsenseQA("dev") >>> test = lfds.CommonsenseQA("test")">
>>> import lineflow.datasets as lfds

>>> train = lfds.CommonsenseQA("train")
>>> dev = lfds.CommonsenseQA("dev")
>>> test = lfds.CommonsenseQA("test")

The items in this datset as follows:

>> train.first() {"id": "075e483d21c29a511267ef62bedc0461", "answer_key": "A", "options": {"A": "ignore", "B": "enforce", "C": "authoritarian", "D": "yell at", "E": "avoid"}, "stem": "The sanctions against the school were a punishing blow, and they seemed to what the efforts the school had made to change?"} }">
>>> import lineflow.datasets as lfds

>>> train = lfds.CommonsenseQA("train")
>>> train.first()
{"id": "075e483d21c29a511267ef62bedc0461",
 "answer_key": "A",
 "options": {"A": "ignore",
 "B": "enforce",
 "C": "authoritarian",
 "D": "yell at",
 "E": "avoid"},
 "stem": "The sanctions against the school were a punishing blow, and they seemed to what the efforts the school had made to change?"}
}

Language Modeling

Penn Treebank

Loads the Penn Treebank dataset:

import lineflow.datasets as lfds

train = lfds.PennTreebank('train')
dev = lfds.PennTreebank('dev')
test = lfds.PennTreebank('test')

WikiText-103

Loads the WikiText-103 dataset:

import lineflow.datasets as lfds

train = lfds.WikiText103('train')
dev = lfds.WikiText103('dev')
test = lfds.WikiText103('test')

This dataset is preprossed, so you can tokenize each line with str.split:

>>> import lineflow.datasets as lfds
>>> train = lfds.WikiText103('train').flat_map(lambda x: x.split() + ['
   
    '
   ])
>>> train.take(5)
['
   
    '
   , '=', 'Valkyria', 'Chronicles', 'III']

WikiText-2 (Added by @sobamchan, thanks.)

Loads the WikiText-2 dataset:

import lineflow.datasets as lfds

train = lfds.WikiText2('train')
dev = lfds.WikiText2('dev')
test = lfds.WikiText2('test')

This dataset is preprossed, so you can tokenize each line with str.split:

>>> import lineflow.datasets as lfds
>>> train = lfds.WikiText2('train').flat_map(lambda x: x.split() + ['
   
    '
   ])
>>> train.take(5)
['
   
    '
   , '=', 'Valkyria', 'Chronicles', 'III']

Machine Translation

small_parallel_enja:

Loads the small_parallel_enja dataset which is small English-Japanese parallel corpus:

import lineflow.datasets as lfds

train = lfds.SmallParallelEnJa('train')
dev = lfds.SmallParallelEnJa('dev')
test = lfd.SmallParallelEnJa('test')

This dataset is preprossed, so you can tokenize each line with str.split:

>>> import lineflow.datasets as lfds
>>> train = lfds.SmallParallelEnJa('train').map(lambda x: (x[0].split(), x[1].split()))
>>> train.first()
(['i', 'can', "'t", 'tell', 'who', 'will', 'arrive', 'first', '.'], ['誰', 'が', '一番', 'に', '着', 'く', 'か', '私', 'に', 'は', '分か', 'り', 'ま', 'せ', 'ん', '。']

Paraphrase

Microsoft Research Paraphrase Corpus:

Loads the Miscrosoft Research Paraphrase Corpus:

import lineflow.datasets as lfds

train = lfds.MsrParaphrase('train')
test = lfds.MsrParaphrase('test')

The item in this dataset as follows:

>>> import lineflow.datasets as lfds
>>> train = lfds.MsrParaphrase('train')
>>> train.first()
{'quality': '1',
 'id1': '702876',
 'id2': '702977',
 'string1': 'Amrozi accused his brother, whom he called "the witness", of deliberately distorting his evidence.',
 'string2': 'Referring to him as only "the witness", Amrozi accused his brother of deliberately distorting his evidence.'
}

Question Answering

SQuAD:

Loads the SQuAD dataset:

import lineflow.datasets as lfds

train = lfds.Squad('train')
dev = lfds.Squad('dev')

The item in this dataset as follows:

>>> import lineflow.datasets as lfds
>>> train = lfds.Squad('train')
>>> train.first()
{'answers': [{'answer_start': 515, 'text': 'Saint Bernadette Soubirous'}],
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'id': '5733be284776f41900661182',
 'title': 'University_of_Notre_Dame',
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.'}

Sentiment Analysis

IMDB:

Loads the IMDB dataset:

import lineflow.datasets as lfds

train = lfds.Imdb('train')
test = lfds.Imdb('test')

The item in this dataset as follows:

>>> import lineflow.datasets as lfds
>>> train = lfds.Imdb('train')
>>> train.first()
('For a movie that gets no respect there sure are a lot of memorable quotes listed for this gem. Imagine a movie where Joe Piscopo is actually funny! Maureen Stapleton is a scene stealer. The Moroni character is an absolute scream. Watch for Alan "The Skipper" Hale jr. as a police Sgt.', 0)

Sequence Tagging

CoNLL2000

Loads the CoNLL2000 dataset:

import lineflow.datasets as lfds

train = lfds.Conll2000('train')
test = lfds.Conll2000('test')

Text Summarization

CNN / Daily Mail:

Loads the CNN / Daily Mail dataset:

import lineflow.datasets as lfds

train = lfds.CnnDailymail('train')
dev = lfds.CnnDailymail('dev')
test = lfds.CnnDailymail('test')

This dataset is preprossed, so you can tokenize each line with str.split:

>>> import lineflow.datasets as lfds
>>> train = lfds.CnnDailymail('train').map(lambda x: (x[0].split(), x[1].split()))
>>> train.first()
... # the output is omitted because it's too long to display here.

SciTLDR

Loads the TLDR dataset:

import lineflow.datasets as lfds

train = lfds.SciTLDR('train')
dev = lfds.SciTLDR('dev')
test = lfds.SciTLDR('test')
Comments
  • Revert

    Revert "Added CommonsenseQA dataset."

    I'm sorry to mention after merging but I'd like you to fix these below:

    • Add CommonsenseQA to README.md
    • Add lineflow.commonsenseqa.get_commonsenseqa to lineflow/datasets/__init__.py
    opened by yasufumy 3
  • Should slice of IterableDataset return IterableDataset not List?

    Should slice of IterableDataset return IterableDataset not List?

    Is your feature request related to a problem? Please describe.

    train = lfds.SciTLDR(split="train")  # IterableDataset
    train_mini = train[:10]  # Now this is just a python list (List[Any])
    

    If I make a subset of a dataset, it loses all the features such as .map.

    Describe the solution you'd like Return IterableDataset in stead of List.

    opened by sobamchan 2
  • Bump pytest-cov from 2.10.1 to 2.11.0

    Bump pytest-cov from 2.10.1 to 2.11.0

    Bumps pytest-cov from 2.10.1 to 2.11.0.

    Changelog

    Sourced from pytest-cov's changelog.

    2.11.0 (2021-01-18)

    • Bumped minimum coverage requirement to 5.2.1. This prevents reporting issues. Contributed by Mateus Berardo de Souza Terra in #433.
    • Improved sample projects (from the examples directory) to support running tox -e pyXY. Now the example configures a suffixed coverage data file, and that makes the cleanup environment unnecessary. Contributed by Ganden Schaffner in #435.
    • Removed the empty console_scripts entrypoint that confused some Gentoo build script. I didn't ask why it was so broken cause I didn't want to ruin my day. Contributed by Michał Górny in #434.
    • Fixed the missing coverage context when using subprocesses. Contributed by Bernát Gábor in #443.
    • Updated the config section in the docs. Contributed by Pamela McA'Nulty in #429.
    • Migrated CI to travis-ci.com (from .org).
    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language
    • @dependabot badge me will comment on this PR with code to add a "Dependabot enabled" badge to your readme

    Additionally, you can set the following in your Dependabot dashboard:

    • Update frequency (including time of day and day of week)
    • Pull request limits (per update run and/or open at any time)
    • Out-of-range updates (receive only lockfile updates, if desired)
    • Security updates (receive only security updates, if desired)
    dependencies 
    opened by dependabot-preview[bot] 2
  • Bump flake8 from 3.7.9 to 3.8.1

    Bump flake8 from 3.7.9 to 3.8.1

    Bumps flake8 from 3.7.9 to 3.8.1.

    Commits
    • f94e009 Release 3.8.1
    • 00985a6 Merge branch 'issue638-ouput-file' into 'master'
    • e6d8a90 options: Forward --output-file to be reparsed for BaseFormatter
    • b4d2850 Release 3.8.0
    • 03c7dd3 Merge branch 'exclude_dotfiles' into 'master'
    • 9e67511 Fix using --exclude=.* to not match . and ..
    • 6c4b5c8 Merge branch 'linters_py3' into 'master'
    • 309db63 switch dogfood to use python3
    • 8905a7a Merge branch 'logical_position_out_of_bounds' into 'master'
    • 609010c Fix logical checks which report position out of bounds
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language
    • @dependabot badge me will comment on this PR with code to add a "Dependabot enabled" badge to your readme

    Additionally, you can set the following in your Dependabot dashboard:

    • Update frequency (including time of day and day of week)
    • Pull request limits (per update run and/or open at any time)
    • Out-of-range updates (receive only lockfile updates, if desired)
    • Security updates (receive only security updates, if desired)
    dependencies 
    opened by dependabot-preview[bot] 2
  • Bump ipython from 7.22.0 to 7.23.0

    Bump ipython from 7.22.0 to 7.23.0

    Bumps ipython from 7.22.0 to 7.23.0.

    Commits
    • a0c0411 release 7.23.0
    • d1b43f2 Merge pull request #12936 from meeseeksmachine/auto-backport-of-pr-12934-on-7.x
    • 5fee80a Merge pull request #12935 from Carreau/auto-backport-of-pr-12932-on-7.x
    • 9f04101 Backport PR #12934: 7.23 release notes
    • 994fcbe Backport PR #12932: remove use of deprecated pipes module
    • a8955db Merge pull request #12925 from Carreau/auto-backport-of-pr-12817-on-7.x
    • feeb4ea Backport PR #12817: Use matplotlib-inline instead of ipykernel.pylab
    • 288ca33 Merge pull request #12919 from Carreau/auto-backport-of-pr-12823-on-7.x
    • 0f52b53 Backport PR #12823: Added clear kwarg to display()
    • 197b993 Merge pull request #12911 from meeseeksmachine/auto-backport-of-pr-12758-on-7.x
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language
    • @dependabot badge me will comment on this PR with code to add a "Dependabot enabled" badge to your readme

    Additionally, you can set the following in your Dependabot dashboard:

    • Update frequency (including time of day and day of week)
    • Pull request limits (per update run and/or open at any time)
    • Out-of-range updates (receive only lockfile updates, if desired)
    • Security updates (receive only security updates, if desired)
    dependencies 
    opened by dependabot-preview[bot] 1
  • Bump autopep8 from 1.5.4 to 1.5.5

    Bump autopep8 from 1.5.4 to 1.5.5

    Bumps autopep8 from 1.5.4 to 1.5.5.

    Release notes

    Sourced from autopep8's releases.

    v1.5.5

    bug fix and minor improvements

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language
    • @dependabot badge me will comment on this PR with code to add a "Dependabot enabled" badge to your readme

    Additionally, you can set the following in your Dependabot dashboard:

    • Update frequency (including time of day and day of week)
    • Pull request limits (per update run and/or open at any time)
    • Out-of-range updates (receive only lockfile updates, if desired)
    • Security updates (receive only security updates, if desired)
    dependencies 
    opened by dependabot-preview[bot] 1
  • Bump ipython from 7.19.0 to 7.20.0

    Bump ipython from 7.19.0 to 7.20.0

    Bumps ipython from 7.19.0 to 7.20.0.

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language
    • @dependabot badge me will comment on this PR with code to add a "Dependabot enabled" badge to your readme

    Additionally, you can set the following in your Dependabot dashboard:

    • Update frequency (including time of day and day of week)
    • Pull request limits (per update run and/or open at any time)
    • Out-of-range updates (receive only lockfile updates, if desired)
    • Security updates (receive only security updates, if desired)
    dependencies 
    opened by dependabot-preview[bot] 1
  • Bump pytest from 6.2.1 to 6.2.2

    Bump pytest from 6.2.1 to 6.2.2

    Bumps pytest from 6.2.1 to 6.2.2.

    Release notes

    Sourced from pytest's releases.

    6.2.2

    pytest 6.2.2 (2021-01-25)

    Bug Fixes

    • #8152: Fixed "(<Skipped instance>)" being shown as a skip reason in the verbose test summary line when the reason is empty.
    • #8249: Fix the faulthandler plugin for occasions when running with twisted.logger and using pytest --capture=no.
    Changelog

    Sourced from pytest's changelog.

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language
    • @dependabot badge me will comment on this PR with code to add a "Dependabot enabled" badge to your readme

    Additionally, you can set the following in your Dependabot dashboard:

    • Update frequency (including time of day and day of week)
    • Pull request limits (per update run and/or open at any time)
    • Out-of-range updates (receive only lockfile updates, if desired)
    • Security updates (receive only security updates, if desired)
    dependencies 
    opened by dependabot-preview[bot] 1
  • Bump pytest-cov from 2.10.1 to 2.11.1

    Bump pytest-cov from 2.10.1 to 2.11.1

    Bumps pytest-cov from 2.10.1 to 2.11.1.

    Changelog

    Sourced from pytest-cov's changelog.

    2.11.1 (2021-01-20)

    • Fixed support for newer setuptools (v42+). Contributed by Michał Górny in #451.

    2.11.0 (2021-01-18)

    • Bumped minimum coverage requirement to 5.2.1. This prevents reporting issues. Contributed by Mateus Berardo de Souza Terra in #433.
    • Improved sample projects (from the examples directory) to support running tox -e pyXY. Now the example configures a suffixed coverage data file, and that makes the cleanup environment unnecessary. Contributed by Ganden Schaffner in #435.
    • Removed the empty console_scripts entrypoint that confused some Gentoo build script. I didn't ask why it was so broken cause I didn't want to ruin my day. Contributed by Michał Górny in #434.
    • Fixed the missing coverage context when using subprocesses. Contributed by Bernát Gábor in #443.
    • Updated the config section in the docs. Contributed by Pamela McA'Nulty in #429.
    • Migrated CI to travis-ci.com (from .org).
    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language
    • @dependabot badge me will comment on this PR with code to add a "Dependabot enabled" badge to your readme

    Additionally, you can set the following in your Dependabot dashboard:

    • Update frequency (including time of day and day of week)
    • Pull request limits (per update run and/or open at any time)
    • Out-of-range updates (receive only lockfile updates, if desired)
    • Security updates (receive only security updates, if desired)
    dependencies 
    opened by dependabot-preview[bot] 1
  • Bump isort from 5.6.4 to 5.7.0

    Bump isort from 5.6.4 to 5.7.0

    Bumps isort from 5.6.4 to 5.7.0.

    Release notes

    Sourced from isort's releases.

    5.7.0 December 30th 2020

    • Fixed #1612: In rare circumstances an extra comma is added after import and before comment.
    • Fixed #1593: isort encounters bug in Python 3.6.0.
    • Implemented #1596: Provide ways for extension formatting and file paths to be specified when using streaming input from CLI.
    • Implemented #1583: Ability to output and diff within a single API call to isort.file.
    • Implemented #1562, #1592 & #1593: Better more useful fatal error messages.
    • Implemented #1575: Support for automatically fixing mixed indentation of import sections.
    • Implemented #1582: Added a CLI option for skipping symlinks.
    • Implemented #1603: Support for disabling float_to_top from the command line.
    • Implemented #1604: Allow toggling section comments on and off for indented import sections.
    Changelog

    Sourced from isort's changelog.

    5.7.0 December 30th 2020

    • Fixed #1612: In rare circumstances an extra comma is added after import and before comment.
    • Fixed #1593: isort encounters bug in Python 3.6.0.
    • Implemented #1596: Provide ways for extension formatting and file paths to be specified when using streaming input from CLI.
    • Implemented #1583: Ability to output and diff within a single API call to isort.file.
    • Implemented #1562, #1592 & #1593: Better more useful fatal error messages.
    • Implemented #1575: Support for automatically fixing mixed indentation of import sections.
    • Implemented #1582: Added a CLI option for skipping symlinks.
    • Implemented #1603: Support for disabling float_to_top from the command line.
    • Implemented #1604: Allow toggling section comments on and off for indented import sections.
    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language
    • @dependabot badge me will comment on this PR with code to add a "Dependabot enabled" badge to your readme

    Additionally, you can set the following in your Dependabot dashboard:

    • Update frequency (including time of day and day of week)
    • Pull request limits (per update run and/or open at any time)
    • Out-of-range updates (receive only lockfile updates, if desired)
    • Security updates (receive only security updates, if desired)
    dependencies 
    opened by dependabot-preview[bot] 1
  • Bump pytest from 6.1.2 to 6.2.0

    Bump pytest from 6.1.2 to 6.2.0

    Bumps pytest from 6.1.2 to 6.2.0.

    Release notes

    Sourced from pytest's releases.

    6.2.0

    pytest 6.2.0 (2020-12-12)

    Breaking Changes

    • #7808: pytest now supports python3.6+ only.

    Deprecations

    • #7469: Directly constructing/calling the following classes/functions is now deprecated:

      • _pytest.cacheprovider.Cache
      • _pytest.cacheprovider.Cache.for_config()
      • _pytest.cacheprovider.Cache.clear_cache()
      • _pytest.cacheprovider.Cache.cache_dir_from_config()
      • _pytest.capture.CaptureFixture
      • _pytest.fixtures.FixtureRequest
      • _pytest.fixtures.SubRequest
      • _pytest.logging.LogCaptureFixture
      • _pytest.pytester.Pytester
      • _pytest.pytester.Testdir
      • _pytest.recwarn.WarningsRecorder
      • _pytest.recwarn.WarningsChecker
      • _pytest.tmpdir.TempPathFactory
      • _pytest.tmpdir.TempdirFactory

      These have always been considered private, but now issue a deprecation warning, which may become a hard error in pytest 7.0.0.

    • #7530: The --strict command-line option has been deprecated, use --strict-markers instead.

      We have plans to maybe in the future to reintroduce --strict and make it an encompassing flag for all strictness related options (--strict-markers and --strict-config at the moment, more might be introduced in the future).

    • #7988: The @pytest.yield_fixture decorator/function is now deprecated. Use pytest.fixture instead.

      yield_fixture has been an alias for fixture for a very long time, so can be search/replaced safely.

    Features

    • #5299: pytest now warns about unraisable exceptions and unhandled thread exceptions that occur in tests on Python>=3.8. See unraisable for more information.

    • #7425: New pytester fixture, which is identical to testdir but its methods return pathlib.Path when appropriate instead of py.path.local.

      This is part of the movement to use pathlib.Path objects internally, in order to remove the dependency to py in the future.

      Internally, the old Testdir <_pytest.pytester.Testdir> is now a thin wrapper around Pytester <_pytest.pytester.Pytester>, preserving the old interface.

    Changelog

    Sourced from pytest's changelog.

    Commits
    • e7073af Prepare release version 6.2.0
    • 683f29f Merge pull request #8129 from bluetech/docs-pygments-workaround
    • 0feeddf doc: temporary workaround for pytest-pygments lexing error
    • b478275 Merge pull request #8128 from bluetech/skip-reason-empty
    • 3302ff9 terminal: when the skip/xfail is empty, don't show it as "()"
    • 59bd0f6 Merge pull request #8126 from bluetech/tox-regen-pretend-scm2
    • 6298ff1 tox: use pip legacy resolver for regen job
    • d51ecbd Merge pull request #8125 from bluetech/tox-rm-pip-req
    • f237b07 tox: remove requires: pip>=20.3.1
    • 95e0e19 Merge pull request #8124 from bluetech/s0undt3ch-feature/skip-context-hook
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language
    • @dependabot badge me will comment on this PR with code to add a "Dependabot enabled" badge to your readme

    Additionally, you can set the following in your Dependabot dashboard:

    • Update frequency (including time of day and day of week)
    • Pull request limits (per update run and/or open at any time)
    • Out-of-range updates (receive only lockfile updates, if desired)
    • Security updates (receive only security updates, if desired)
    dependencies 
    opened by dependabot-preview[bot] 1
  • CVE-2007-4559 Patch

    CVE-2007-4559 Patch

    Patching CVE-2007-4559

    Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

    If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

    opened by TrellixVulnTeam 0
  • wmt14 google drive link is dead now.

    wmt14 google drive link is dead now.

    Describe the bug the google drive link to download wmt14 dataset is now unavailable.

    To Reproduce

    import lineflow.datasets as lfds
    train_dataset = lfds.Wmt14("train")
    

    Expected behavior A clear and concise description of what you expected to happen.

    Screenshots If applicable, add screenshots to help explain your problem.

    Desktop (please complete the following information):

    • OS: [e.g. iOS]
    • Browser [e.g. chrome, safari]
    • Version [e.g. 22]

    Smartphone (please complete the following information):

    • Device: [e.g. iPhone6]
    • OS: [e.g. iOS8.1]
    • Browser [e.g. stock browser, safari]
    • Version [e.g. 22]

    Additional context I can try finding a working URL when I have some time.

    opened by sobamchan 0
  • Add support for <SNLI and MLNLI>

    Add support for

    Datasets

    *** SNLI dataset *** : https://nlp.stanford.edu/projects/snli/ *** MLNLI dataset *** : https://www.nyu.edu/projects/bowman/multinli/

    Please provide both of the datasets individually and the combined dataset as ALLNLI

    opened by ashutosh-dwivedi-e3502 0
Releases(v0.6.8)
Python library to make development of portfolio analysis faster and easier

Trafalgar Python library to make development of portfolio analysis faster and easier Installation 🔥 For the moment, Trafalgar is still in beta develo

Santosh Passoubady 641 Jan 01, 2023
Utilize Korean BERT model in sentence-transformers library

ko-sentence-transformers 이 프로젝트는 KoBERT 모델을 sentence-transformers 에서 보다 쉽게 사용하기 위해 만들어졌습니다. Ko-Sentence-BERT-SKTBERT 프로젝트에서는 KoBERT 모델을 sentence-trans

Junghyun 40 Dec 20, 2022
a test times augmentation toolkit based on paddle2.0.

Patta Image Test Time Augmentation with Paddle2.0! Input | # input batch of images / / /|\ \ \ # apply

AgentMaker 110 Dec 03, 2022
This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Proteno This is the data release associated with the corresponding NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deploymen

37 Dec 04, 2022
MASS: Masked Sequence to Sequence Pre-training for Language Generation

MASS: Masked Sequence to Sequence Pre-training for Language Generation

Microsoft 1.1k Dec 17, 2022
KoBART model on huggingface transformers

KoBART-Transformers SKT에서 공개한 KoBART를 편리하게 사용할 수 있게 transformers로 포팅하였습니다. Install (Optional) BartModel과 PreTrainedTokenizerFast를 이용하면 설치하실 필요 없습니다. p

Hyunwoong Ko 58 Dec 07, 2022
Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

STEMM: Self-learning with Speech-Text Manifold Mixup for Speech Translation This is a PyTorch implementation for the ACL 2022 main conference paper ST

ICTNLP 29 Oct 16, 2022
Ukrainian TTS (text-to-speech) using Coqui TTS

title emoji colorFrom colorTo sdk app_file pinned Ukrainian TTS 🐸 green green gradio app.py false Ukrainian TTS 📢 🤖 Ukrainian TTS (text-to-speech)

Yurii Paniv 85 Dec 26, 2022
A Practitioner's Guide to Natural Language Processing

Learn how to process, classify, cluster, summarize, understand syntax, semantics and sentiment of text data with the power of Python! This repository contains code and datasets used in my book, Text

Dipanjan (DJ) Sarkar 1.5k Jan 03, 2023
Unsupervised Language Model Pre-training for French

FlauBERT and FLUE FlauBERT is a French BERT trained on a very large and heterogeneous French corpus. Models of different sizes are trained using the n

GETALP 212 Dec 10, 2022
A python package to fine-tune transformer-based models for named entity recognition (NER).

nerblackbox A python package to fine-tune transformer-based language models for named entity recognition (NER). Resources Source Code: https://github.

Felix Stollenwerk 13 Jul 30, 2022
Code for the paper "Flexible Generation of Natural Language Deductions"

Code for the paper "Flexible Generation of Natural Language Deductions"

Kaj Bostrom 12 Nov 11, 2022
Repository for the paper "Optimal Subarchitecture Extraction for BERT"

Bort Companion code for the paper "Optimal Subarchitecture Extraction for BERT." Bort is an optimal subset of architectural parameters for the BERT ar

Alexa 461 Nov 21, 2022
Amazon Multilingual Counterfactual Dataset (AMCD)

Amazon Multilingual Counterfactual Dataset (AMCD)

35 Sep 20, 2022
2021搜狐校园文本匹配算法大赛baseline

sohu2021-baseline 2021搜狐校园文本匹配算法大赛baseline 简介 分享了一个搜狐文本匹配的baseline,主要是通过条件LayerNorm来增加模型的多样性,以实现同一模型处理不同类型的数据、形成不同输出的目的。 线下验证集F1约0.74,线上测试集F1约0.73。

苏剑林(Jianlin Su) 45 Sep 06, 2022
text to speech toolkit. 好用的中文语音合成工具箱,包含语音编码器、语音合成器、声码器和可视化模块。

ttskit Text To Speech Toolkit: 语音合成工具箱。 安装 pip install -U ttskit 注意 可能需另外安装的依赖包:torch,版本要求torch=1.6.0,=1.7.1,根据自己的实际环境安装合适cuda或cpu版本的torch。 ttskit的

KDD 483 Jan 04, 2023
Findings of ACL 2021

Assessing Dialogue Systems with Distribution Distances [arXiv][code] We propose to measure the performance of a dialogue system by computing the distr

Yahui Liu 16 Feb 24, 2022
The following links explain a bit the idea of semantic search and how search mechanisms work by doing retrieve and rerank

Main Idea The following links explain a bit the idea of semantic search and how search mechanisms work by doing retrieve and rerank Semantic Search Re

Sergio Arnaud Gomez 2 Jan 28, 2022
Proquabet - Convert your prose into proquints and then you essentially have Vogon poetry

Proquabet Turn your prose into a constant stream of encrypted and meaningless-so

Milo Fultz 2 Oct 10, 2022
Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors"

SWRM Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors" Clone Clone th

14 Jan 03, 2023