A Lightweight NLP Data Loader for All Deep Learning Frameworks in Python

Overview

LineFlow: Framework-Agnostic NLP Data Loader in Python

CI codecov

LineFlow is a simple text dataset loader for NLP deep learning tasks.

  • LineFlow was designed to use in all deep learning frameworks.
  • LineFlow enables you to build pipelines via functional APIs (.map, .filter, .flat_map).
  • LineFlow provides common NLP datasets.

LineFlow is heavily inspired by tensorflow.data.Dataset and chainer.dataset.

Basic Usage

lineflow.TextDataset expects line-oriented text files:

import lineflow as lf


'''/path/to/text will be expected as follows:
i 'm a line 1 .
i 'm a line 2 .
i 'm a line 3 .
'''
ds = lf.TextDataset('/path/to/text')

ds.first()  # "i 'm a line 1 ."
ds.all() # ["i 'm a line 1 .", "i 'm a line 2 .", "i 'm a line 3 ."]
len(ds)  # 3
ds.map(lambda x: x.split()).first()  # ["i", "'m", "a", "line", "1", "."]

Example

  • Please check out the examples to see how to use LineFlow, especially for tokenization, building vocabulary, and indexing.

Loads Penn Treebank:

>>> import lineflow.datasets as lfds
>>> train = lfds.PennTreebank('train')
>>> train.first()
' aer banknote berlitz calloway centrust cluett fromstein gitano guterman hydro-quebec ipo kia memotec mlx nahb punts rake regatta rubens sim snack-food ssangyong swapo wachter '

Splits the sentence to the words:

>>> # continuing from above
>>> train = train.map(str.split)
>>> train.first()
['aer', 'banknote', 'berlitz', 'calloway', 'centrust', 'cluett', 'fromstein', 'gitano', 'guterman', 'hydro-quebec', 'ipo', 'kia', 'memotec', 'mlx', 'nahb', 'punts', 'rake', 'regatta', 'rubens', 'sim', 'snack-food', 'ssangyong', 'swapo', 'wachter']

Obtains words in dataset:

>>> # continuing from above
>>> words = train.flat_map(lambda x: x)
>>> words.take(5) # This is useful to build vocabulary.
['aer', 'banknote', 'berlitz', 'calloway', 'centrust']

Further more:

Requirements

  • Python3.6+

Installation

To install LineFlow:

pip install lineflow

Datasets

Is the dataset you want to use not supported? Suggest a new dataset 🎉

Commonsense Reasoning

CommonsenseQA

Loads the CommonsenseQA dataset:

>> dev = lfds.CommonsenseQA("dev") >>> test = lfds.CommonsenseQA("test")">
>>> import lineflow.datasets as lfds

>>> train = lfds.CommonsenseQA("train")
>>> dev = lfds.CommonsenseQA("dev")
>>> test = lfds.CommonsenseQA("test")

The items in this datset as follows:

>> train.first() {"id": "075e483d21c29a511267ef62bedc0461", "answer_key": "A", "options": {"A": "ignore", "B": "enforce", "C": "authoritarian", "D": "yell at", "E": "avoid"}, "stem": "The sanctions against the school were a punishing blow, and they seemed to what the efforts the school had made to change?"} }">
>>> import lineflow.datasets as lfds

>>> train = lfds.CommonsenseQA("train")
>>> train.first()
{"id": "075e483d21c29a511267ef62bedc0461",
 "answer_key": "A",
 "options": {"A": "ignore",
 "B": "enforce",
 "C": "authoritarian",
 "D": "yell at",
 "E": "avoid"},
 "stem": "The sanctions against the school were a punishing blow, and they seemed to what the efforts the school had made to change?"}
}

Language Modeling

Penn Treebank

Loads the Penn Treebank dataset:

import lineflow.datasets as lfds

train = lfds.PennTreebank('train')
dev = lfds.PennTreebank('dev')
test = lfds.PennTreebank('test')

WikiText-103

Loads the WikiText-103 dataset:

import lineflow.datasets as lfds

train = lfds.WikiText103('train')
dev = lfds.WikiText103('dev')
test = lfds.WikiText103('test')

This dataset is preprossed, so you can tokenize each line with str.split:

>>> import lineflow.datasets as lfds
>>> train = lfds.WikiText103('train').flat_map(lambda x: x.split() + ['
   
    '
   ])
>>> train.take(5)
['
   
    '
   , '=', 'Valkyria', 'Chronicles', 'III']

WikiText-2 (Added by @sobamchan, thanks.)

Loads the WikiText-2 dataset:

import lineflow.datasets as lfds

train = lfds.WikiText2('train')
dev = lfds.WikiText2('dev')
test = lfds.WikiText2('test')

This dataset is preprossed, so you can tokenize each line with str.split:

>>> import lineflow.datasets as lfds
>>> train = lfds.WikiText2('train').flat_map(lambda x: x.split() + ['
   
    '
   ])
>>> train.take(5)
['
   
    '
   , '=', 'Valkyria', 'Chronicles', 'III']

Machine Translation

small_parallel_enja:

Loads the small_parallel_enja dataset which is small English-Japanese parallel corpus:

import lineflow.datasets as lfds

train = lfds.SmallParallelEnJa('train')
dev = lfds.SmallParallelEnJa('dev')
test = lfd.SmallParallelEnJa('test')

This dataset is preprossed, so you can tokenize each line with str.split:

>>> import lineflow.datasets as lfds
>>> train = lfds.SmallParallelEnJa('train').map(lambda x: (x[0].split(), x[1].split()))
>>> train.first()
(['i', 'can', "'t", 'tell', 'who', 'will', 'arrive', 'first', '.'], ['誰', 'が', '一番', 'に', '着', 'く', 'か', '私', 'に', 'は', '分か', 'り', 'ま', 'せ', 'ん', '。']

Paraphrase

Microsoft Research Paraphrase Corpus:

Loads the Miscrosoft Research Paraphrase Corpus:

import lineflow.datasets as lfds

train = lfds.MsrParaphrase('train')
test = lfds.MsrParaphrase('test')

The item in this dataset as follows:

>>> import lineflow.datasets as lfds
>>> train = lfds.MsrParaphrase('train')
>>> train.first()
{'quality': '1',
 'id1': '702876',
 'id2': '702977',
 'string1': 'Amrozi accused his brother, whom he called "the witness", of deliberately distorting his evidence.',
 'string2': 'Referring to him as only "the witness", Amrozi accused his brother of deliberately distorting his evidence.'
}

Question Answering

SQuAD:

Loads the SQuAD dataset:

import lineflow.datasets as lfds

train = lfds.Squad('train')
dev = lfds.Squad('dev')

The item in this dataset as follows:

>>> import lineflow.datasets as lfds
>>> train = lfds.Squad('train')
>>> train.first()
{'answers': [{'answer_start': 515, 'text': 'Saint Bernadette Soubirous'}],
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'id': '5733be284776f41900661182',
 'title': 'University_of_Notre_Dame',
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.'}

Sentiment Analysis

IMDB:

Loads the IMDB dataset:

import lineflow.datasets as lfds

train = lfds.Imdb('train')
test = lfds.Imdb('test')

The item in this dataset as follows:

>>> import lineflow.datasets as lfds
>>> train = lfds.Imdb('train')
>>> train.first()
('For a movie that gets no respect there sure are a lot of memorable quotes listed for this gem. Imagine a movie where Joe Piscopo is actually funny! Maureen Stapleton is a scene stealer. The Moroni character is an absolute scream. Watch for Alan "The Skipper" Hale jr. as a police Sgt.', 0)

Sequence Tagging

CoNLL2000

Loads the CoNLL2000 dataset:

import lineflow.datasets as lfds

train = lfds.Conll2000('train')
test = lfds.Conll2000('test')

Text Summarization

CNN / Daily Mail:

Loads the CNN / Daily Mail dataset:

import lineflow.datasets as lfds

train = lfds.CnnDailymail('train')
dev = lfds.CnnDailymail('dev')
test = lfds.CnnDailymail('test')

This dataset is preprossed, so you can tokenize each line with str.split:

>>> import lineflow.datasets as lfds
>>> train = lfds.CnnDailymail('train').map(lambda x: (x[0].split(), x[1].split()))
>>> train.first()
... # the output is omitted because it's too long to display here.

SciTLDR

Loads the TLDR dataset:

import lineflow.datasets as lfds

train = lfds.SciTLDR('train')
dev = lfds.SciTLDR('dev')
test = lfds.SciTLDR('test')
Comments
  • Revert

    Revert "Added CommonsenseQA dataset."

    I'm sorry to mention after merging but I'd like you to fix these below:

    • Add CommonsenseQA to README.md
    • Add lineflow.commonsenseqa.get_commonsenseqa to lineflow/datasets/__init__.py
    opened by yasufumy 3
  • Should slice of IterableDataset return IterableDataset not List?

    Should slice of IterableDataset return IterableDataset not List?

    Is your feature request related to a problem? Please describe.

    train = lfds.SciTLDR(split="train")  # IterableDataset
    train_mini = train[:10]  # Now this is just a python list (List[Any])
    

    If I make a subset of a dataset, it loses all the features such as .map.

    Describe the solution you'd like Return IterableDataset in stead of List.

    opened by sobamchan 2
  • Bump pytest-cov from 2.10.1 to 2.11.0

    Bump pytest-cov from 2.10.1 to 2.11.0

    Bumps pytest-cov from 2.10.1 to 2.11.0.

    Changelog

    Sourced from pytest-cov's changelog.

    2.11.0 (2021-01-18)

    • Bumped minimum coverage requirement to 5.2.1. This prevents reporting issues. Contributed by Mateus Berardo de Souza Terra in #433.
    • Improved sample projects (from the examples directory) to support running tox -e pyXY. Now the example configures a suffixed coverage data file, and that makes the cleanup environment unnecessary. Contributed by Ganden Schaffner in #435.
    • Removed the empty console_scripts entrypoint that confused some Gentoo build script. I didn't ask why it was so broken cause I didn't want to ruin my day. Contributed by Michał Górny in #434.
    • Fixed the missing coverage context when using subprocesses. Contributed by Bernát Gábor in #443.
    • Updated the config section in the docs. Contributed by Pamela McA'Nulty in #429.
    • Migrated CI to travis-ci.com (from .org).
    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language
    • @dependabot badge me will comment on this PR with code to add a "Dependabot enabled" badge to your readme

    Additionally, you can set the following in your Dependabot dashboard:

    • Update frequency (including time of day and day of week)
    • Pull request limits (per update run and/or open at any time)
    • Out-of-range updates (receive only lockfile updates, if desired)
    • Security updates (receive only security updates, if desired)
    dependencies 
    opened by dependabot-preview[bot] 2
  • Bump flake8 from 3.7.9 to 3.8.1

    Bump flake8 from 3.7.9 to 3.8.1

    Bumps flake8 from 3.7.9 to 3.8.1.

    Commits
    • f94e009 Release 3.8.1
    • 00985a6 Merge branch 'issue638-ouput-file' into 'master'
    • e6d8a90 options: Forward --output-file to be reparsed for BaseFormatter
    • b4d2850 Release 3.8.0
    • 03c7dd3 Merge branch 'exclude_dotfiles' into 'master'
    • 9e67511 Fix using --exclude=.* to not match . and ..
    • 6c4b5c8 Merge branch 'linters_py3' into 'master'
    • 309db63 switch dogfood to use python3
    • 8905a7a Merge branch 'logical_position_out_of_bounds' into 'master'
    • 609010c Fix logical checks which report position out of bounds
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language
    • @dependabot badge me will comment on this PR with code to add a "Dependabot enabled" badge to your readme

    Additionally, you can set the following in your Dependabot dashboard:

    • Update frequency (including time of day and day of week)
    • Pull request limits (per update run and/or open at any time)
    • Out-of-range updates (receive only lockfile updates, if desired)
    • Security updates (receive only security updates, if desired)
    dependencies 
    opened by dependabot-preview[bot] 2
  • Bump ipython from 7.22.0 to 7.23.0

    Bump ipython from 7.22.0 to 7.23.0

    Bumps ipython from 7.22.0 to 7.23.0.

    Commits
    • a0c0411 release 7.23.0
    • d1b43f2 Merge pull request #12936 from meeseeksmachine/auto-backport-of-pr-12934-on-7.x
    • 5fee80a Merge pull request #12935 from Carreau/auto-backport-of-pr-12932-on-7.x
    • 9f04101 Backport PR #12934: 7.23 release notes
    • 994fcbe Backport PR #12932: remove use of deprecated pipes module
    • a8955db Merge pull request #12925 from Carreau/auto-backport-of-pr-12817-on-7.x
    • feeb4ea Backport PR #12817: Use matplotlib-inline instead of ipykernel.pylab
    • 288ca33 Merge pull request #12919 from Carreau/auto-backport-of-pr-12823-on-7.x
    • 0f52b53 Backport PR #12823: Added clear kwarg to display()
    • 197b993 Merge pull request #12911 from meeseeksmachine/auto-backport-of-pr-12758-on-7.x
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language
    • @dependabot badge me will comment on this PR with code to add a "Dependabot enabled" badge to your readme

    Additionally, you can set the following in your Dependabot dashboard:

    • Update frequency (including time of day and day of week)
    • Pull request limits (per update run and/or open at any time)
    • Out-of-range updates (receive only lockfile updates, if desired)
    • Security updates (receive only security updates, if desired)
    dependencies 
    opened by dependabot-preview[bot] 1
  • Bump autopep8 from 1.5.4 to 1.5.5

    Bump autopep8 from 1.5.4 to 1.5.5

    Bumps autopep8 from 1.5.4 to 1.5.5.

    Release notes

    Sourced from autopep8's releases.

    v1.5.5

    bug fix and minor improvements

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language
    • @dependabot badge me will comment on this PR with code to add a "Dependabot enabled" badge to your readme

    Additionally, you can set the following in your Dependabot dashboard:

    • Update frequency (including time of day and day of week)
    • Pull request limits (per update run and/or open at any time)
    • Out-of-range updates (receive only lockfile updates, if desired)
    • Security updates (receive only security updates, if desired)
    dependencies 
    opened by dependabot-preview[bot] 1
  • Bump ipython from 7.19.0 to 7.20.0

    Bump ipython from 7.19.0 to 7.20.0

    Bumps ipython from 7.19.0 to 7.20.0.

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language
    • @dependabot badge me will comment on this PR with code to add a "Dependabot enabled" badge to your readme

    Additionally, you can set the following in your Dependabot dashboard:

    • Update frequency (including time of day and day of week)
    • Pull request limits (per update run and/or open at any time)
    • Out-of-range updates (receive only lockfile updates, if desired)
    • Security updates (receive only security updates, if desired)
    dependencies 
    opened by dependabot-preview[bot] 1
  • Bump pytest from 6.2.1 to 6.2.2

    Bump pytest from 6.2.1 to 6.2.2

    Bumps pytest from 6.2.1 to 6.2.2.

    Release notes

    Sourced from pytest's releases.

    6.2.2

    pytest 6.2.2 (2021-01-25)

    Bug Fixes

    • #8152: Fixed "(<Skipped instance>)" being shown as a skip reason in the verbose test summary line when the reason is empty.
    • #8249: Fix the faulthandler plugin for occasions when running with twisted.logger and using pytest --capture=no.
    Changelog

    Sourced from pytest's changelog.

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language
    • @dependabot badge me will comment on this PR with code to add a "Dependabot enabled" badge to your readme

    Additionally, you can set the following in your Dependabot dashboard:

    • Update frequency (including time of day and day of week)
    • Pull request limits (per update run and/or open at any time)
    • Out-of-range updates (receive only lockfile updates, if desired)
    • Security updates (receive only security updates, if desired)
    dependencies 
    opened by dependabot-preview[bot] 1
  • Bump pytest-cov from 2.10.1 to 2.11.1

    Bump pytest-cov from 2.10.1 to 2.11.1

    Bumps pytest-cov from 2.10.1 to 2.11.1.

    Changelog

    Sourced from pytest-cov's changelog.

    2.11.1 (2021-01-20)

    • Fixed support for newer setuptools (v42+). Contributed by Michał Górny in #451.

    2.11.0 (2021-01-18)

    • Bumped minimum coverage requirement to 5.2.1. This prevents reporting issues. Contributed by Mateus Berardo de Souza Terra in #433.
    • Improved sample projects (from the examples directory) to support running tox -e pyXY. Now the example configures a suffixed coverage data file, and that makes the cleanup environment unnecessary. Contributed by Ganden Schaffner in #435.
    • Removed the empty console_scripts entrypoint that confused some Gentoo build script. I didn't ask why it was so broken cause I didn't want to ruin my day. Contributed by Michał Górny in #434.
    • Fixed the missing coverage context when using subprocesses. Contributed by Bernát Gábor in #443.
    • Updated the config section in the docs. Contributed by Pamela McA'Nulty in #429.
    • Migrated CI to travis-ci.com (from .org).
    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language
    • @dependabot badge me will comment on this PR with code to add a "Dependabot enabled" badge to your readme

    Additionally, you can set the following in your Dependabot dashboard:

    • Update frequency (including time of day and day of week)
    • Pull request limits (per update run and/or open at any time)
    • Out-of-range updates (receive only lockfile updates, if desired)
    • Security updates (receive only security updates, if desired)
    dependencies 
    opened by dependabot-preview[bot] 1
  • Bump isort from 5.6.4 to 5.7.0

    Bump isort from 5.6.4 to 5.7.0

    Bumps isort from 5.6.4 to 5.7.0.

    Release notes

    Sourced from isort's releases.

    5.7.0 December 30th 2020

    • Fixed #1612: In rare circumstances an extra comma is added after import and before comment.
    • Fixed #1593: isort encounters bug in Python 3.6.0.
    • Implemented #1596: Provide ways for extension formatting and file paths to be specified when using streaming input from CLI.
    • Implemented #1583: Ability to output and diff within a single API call to isort.file.
    • Implemented #1562, #1592 & #1593: Better more useful fatal error messages.
    • Implemented #1575: Support for automatically fixing mixed indentation of import sections.
    • Implemented #1582: Added a CLI option for skipping symlinks.
    • Implemented #1603: Support for disabling float_to_top from the command line.
    • Implemented #1604: Allow toggling section comments on and off for indented import sections.
    Changelog

    Sourced from isort's changelog.

    5.7.0 December 30th 2020

    • Fixed #1612: In rare circumstances an extra comma is added after import and before comment.
    • Fixed #1593: isort encounters bug in Python 3.6.0.
    • Implemented #1596: Provide ways for extension formatting and file paths to be specified when using streaming input from CLI.
    • Implemented #1583: Ability to output and diff within a single API call to isort.file.
    • Implemented #1562, #1592 & #1593: Better more useful fatal error messages.
    • Implemented #1575: Support for automatically fixing mixed indentation of import sections.
    • Implemented #1582: Added a CLI option for skipping symlinks.
    • Implemented #1603: Support for disabling float_to_top from the command line.
    • Implemented #1604: Allow toggling section comments on and off for indented import sections.
    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language
    • @dependabot badge me will comment on this PR with code to add a "Dependabot enabled" badge to your readme

    Additionally, you can set the following in your Dependabot dashboard:

    • Update frequency (including time of day and day of week)
    • Pull request limits (per update run and/or open at any time)
    • Out-of-range updates (receive only lockfile updates, if desired)
    • Security updates (receive only security updates, if desired)
    dependencies 
    opened by dependabot-preview[bot] 1
  • Bump pytest from 6.1.2 to 6.2.0

    Bump pytest from 6.1.2 to 6.2.0

    Bumps pytest from 6.1.2 to 6.2.0.

    Release notes

    Sourced from pytest's releases.

    6.2.0

    pytest 6.2.0 (2020-12-12)

    Breaking Changes

    • #7808: pytest now supports python3.6+ only.

    Deprecations

    • #7469: Directly constructing/calling the following classes/functions is now deprecated:

      • _pytest.cacheprovider.Cache
      • _pytest.cacheprovider.Cache.for_config()
      • _pytest.cacheprovider.Cache.clear_cache()
      • _pytest.cacheprovider.Cache.cache_dir_from_config()
      • _pytest.capture.CaptureFixture
      • _pytest.fixtures.FixtureRequest
      • _pytest.fixtures.SubRequest
      • _pytest.logging.LogCaptureFixture
      • _pytest.pytester.Pytester
      • _pytest.pytester.Testdir
      • _pytest.recwarn.WarningsRecorder
      • _pytest.recwarn.WarningsChecker
      • _pytest.tmpdir.TempPathFactory
      • _pytest.tmpdir.TempdirFactory

      These have always been considered private, but now issue a deprecation warning, which may become a hard error in pytest 7.0.0.

    • #7530: The --strict command-line option has been deprecated, use --strict-markers instead.

      We have plans to maybe in the future to reintroduce --strict and make it an encompassing flag for all strictness related options (--strict-markers and --strict-config at the moment, more might be introduced in the future).

    • #7988: The @pytest.yield_fixture decorator/function is now deprecated. Use pytest.fixture instead.

      yield_fixture has been an alias for fixture for a very long time, so can be search/replaced safely.

    Features

    • #5299: pytest now warns about unraisable exceptions and unhandled thread exceptions that occur in tests on Python>=3.8. See unraisable for more information.

    • #7425: New pytester fixture, which is identical to testdir but its methods return pathlib.Path when appropriate instead of py.path.local.

      This is part of the movement to use pathlib.Path objects internally, in order to remove the dependency to py in the future.

      Internally, the old Testdir <_pytest.pytester.Testdir> is now a thin wrapper around Pytester <_pytest.pytester.Pytester>, preserving the old interface.

    Changelog

    Sourced from pytest's changelog.

    Commits
    • e7073af Prepare release version 6.2.0
    • 683f29f Merge pull request #8129 from bluetech/docs-pygments-workaround
    • 0feeddf doc: temporary workaround for pytest-pygments lexing error
    • b478275 Merge pull request #8128 from bluetech/skip-reason-empty
    • 3302ff9 terminal: when the skip/xfail is empty, don't show it as "()"
    • 59bd0f6 Merge pull request #8126 from bluetech/tox-regen-pretend-scm2
    • 6298ff1 tox: use pip legacy resolver for regen job
    • d51ecbd Merge pull request #8125 from bluetech/tox-rm-pip-req
    • f237b07 tox: remove requires: pip>=20.3.1
    • 95e0e19 Merge pull request #8124 from bluetech/s0undt3ch-feature/skip-context-hook
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language
    • @dependabot badge me will comment on this PR with code to add a "Dependabot enabled" badge to your readme

    Additionally, you can set the following in your Dependabot dashboard:

    • Update frequency (including time of day and day of week)
    • Pull request limits (per update run and/or open at any time)
    • Out-of-range updates (receive only lockfile updates, if desired)
    • Security updates (receive only security updates, if desired)
    dependencies 
    opened by dependabot-preview[bot] 1
  • CVE-2007-4559 Patch

    CVE-2007-4559 Patch

    Patching CVE-2007-4559

    Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

    If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

    opened by TrellixVulnTeam 0
  • wmt14 google drive link is dead now.

    wmt14 google drive link is dead now.

    Describe the bug the google drive link to download wmt14 dataset is now unavailable.

    To Reproduce

    import lineflow.datasets as lfds
    train_dataset = lfds.Wmt14("train")
    

    Expected behavior A clear and concise description of what you expected to happen.

    Screenshots If applicable, add screenshots to help explain your problem.

    Desktop (please complete the following information):

    • OS: [e.g. iOS]
    • Browser [e.g. chrome, safari]
    • Version [e.g. 22]

    Smartphone (please complete the following information):

    • Device: [e.g. iPhone6]
    • OS: [e.g. iOS8.1]
    • Browser [e.g. stock browser, safari]
    • Version [e.g. 22]

    Additional context I can try finding a working URL when I have some time.

    opened by sobamchan 0
  • Add support for <SNLI and MLNLI>

    Add support for

    Datasets

    *** SNLI dataset *** : https://nlp.stanford.edu/projects/snli/ *** MLNLI dataset *** : https://www.nyu.edu/projects/bowman/multinli/

    Please provide both of the datasets individually and the combined dataset as ALLNLI

    opened by ashutosh-dwivedi-e3502 0
Releases(v0.6.8)
SimpleChinese2 集成了许多基本的中文NLP功能,使基于 Python 的中文文字处理和信息提取变得简单方便。

SimpleChinese2 SimpleChinese2 集成了许多基本的中文NLP功能,使基于 Python 的中文文字处理和信息提取变得简单方便。 声明 本项目是为方便个人工作所创建的,仅有部分代码原创。

Ming 30 Dec 02, 2022
Semantic search for quotes.

squote A semantic search engine that takes some input text and returns some (questionably) relevant (questionably) famous quotes. Built with: bert-as-

cjwallace 11 Jun 25, 2022
API for the GPT-J language model 🦜. Including a FastAPI backend and a streamlit frontend

gpt-j-api 🦜 An API to interact with the GPT-J language model. You can use and test the model in two different ways: Streamlit web app at http://api.v

Víctor Gallego 276 Dec 31, 2022
JaQuAD: Japanese Question Answering Dataset

JaQuAD: Japanese Question Answering Dataset for Machine Reading Comprehension (2022, Skelter Labs)

SkelterLabs 84 Dec 27, 2022
The tool to make NLP datasets ready to use

chazutsu photo from Kaikado, traditional Japanese chazutsu maker chazutsu is the dataset downloader for NLP. import chazutsu r = chazutsu.data

chakki 243 Dec 29, 2022
NLTK Source

Natural Language Toolkit (NLTK) NLTK -- the Natural Language Toolkit -- is a suite of open source Python modules, data sets, and tutorials supporting

Natural Language Toolkit 11.4k Jan 04, 2023
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Rasa 15.3k Dec 30, 2022
AMUSE - financial summarization

AMUSE AMUSE - financial summarization Unzip data.zip Train new model: python FinAnalyze.py --task train --start 0 --count how many files,-1 for all

1 Jan 11, 2022
GVT is a generic translation tool for parts of text on the PC screen with Text to Speak functionality.

GVT is a generic translation tool for parts of text on the PC screen with Text to Speech functionality. I wanted to create it because the existing tools that I experimented with did not satisfy me in

Nuked 1 Aug 21, 2022
A PyTorch implementation of the Transformer model in "Attention is All You Need".

Attention is all you need: A Pytorch Implementation This is a PyTorch implementation of the Transformer model in "Attention is All You Need" (Ashish V

Yu-Hsiang Huang 7.1k Jan 05, 2023
TruthfulQA: Measuring How Models Imitate Human Falsehoods

TruthfulQA: Measuring How Models Imitate Human Falsehoods

69 Dec 25, 2022
Unofficial implementation of Google's FNet: Mixing Tokens with Fourier Transforms

FNet: Mixing Tokens with Fourier Transforms Pytorch implementation of Fnet : Mixing Tokens with Fourier Transforms. Citation: @misc{leethorp2021fnet,

Rishikesh (ऋषिकेश) 217 Dec 05, 2022
DeBERTa: Decoding-enhanced BERT with Disentangled Attention

DeBERTa: Decoding-enhanced BERT with Disentangled Attention This repository is the official implementation of DeBERTa: Decoding-enhanced BERT with Dis

Microsoft 1.2k Jan 03, 2023
A Semi-Intelligent ChatBot filled with statistical and economical data for the Premier League.

MONEYBALL - ChatBot Module: 4006CEM, Class: B, Group: 5 Contributors: Jonas Djondo Roshan Kc Cole Samson Daniel Rodrigues Ihteshaam Naseer Kind remind

Jonas Djondo 1 Nov 18, 2021
All the code I wrote for Overwatch-related projects that I still own the rights to.

overwatch_shit.zip This is (eventually) going to contain all the software I wrote during my five-year imprisonment stay playing Overwatch. I'll be add

zkxjzmswkwl 2 Dec 31, 2021
Black for Python docstrings and reStructuredText (rst).

Style-Doc Style-Doc is Black for Python docstrings and reStructuredText (rst). It can be used to format docstrings (Google docstring format) in Python

Telekom Open Source Software 13 Oct 24, 2022
CYGNUS, the Cynical AI, combines snarky responses with uncanny aggression.

New & (hopefully) Improved CYGNUS with several API updates, user updates, and online/offline operations added!!!

Simran Farrukh 0 Mar 28, 2022
A retro text-to-speech bot for Discord

hawking A retro text-to-speech bot for Discord, designed to work with all of the stuff you might've seen in Moonbase Alpha, using the existing command

Nick Schorr 23 Dec 25, 2022
This repository serves as a place to document a toy attempt on how to create a generative text model in Catalan, based on GPT-2

GPT-2 Catalan playground and scripts to train a GPT-2 model either from scrath or from another pretrained model.

Laura 1 Jan 28, 2022
Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Spanish Language Models 💃🏻 Corpora 📃 Corpora Number of documents Size (GB) BNE 201,080,084 570GB Models 🤖 RoBERTa-base BNE: https://huggingface.co

PlanTL-SANIDAD 203 Dec 20, 2022