Basic Utilities for PyTorch Natural Language Processing (NLP)

Overview

Basic Utilities for PyTorch Natural Language Processing (NLP)

PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. torchnlp extends PyTorch to provide you with basic text data processing functions.

PyPI - Python Version Codecov Downloads Documentation Status Build Status Twitter: PetrochukM

Logo by Chloe Yeo, Corporate Sponsorship by WellSaid Labs

Installation 🐾

Make sure you have Python 3.6+ and PyTorch 1.0+. You can then install pytorch-nlp using pip:

pip install pytorch-nlp

Or to install the latest code via:

pip install git+https://github.com/PetrochukM/PyTorch-NLP.git

Docs

The complete documentation for PyTorch-NLP is available via our ReadTheDocs website.

Get Started

Within an NLP data pipeline, you'll want to implement these basic steps:

1. Load your Data 🐿

Load the IMDB dataset, for example:

from torchnlp.datasets import imdb_dataset

# Load the imdb training dataset
train = imdb_dataset(train=True)
train[0]  # RETURNS: {'text': 'For a movie that gets..', 'sentiment': 'pos'}

Load a custom dataset, for example:

from pathlib import Path

from torchnlp.download import download_file_maybe_extract

directory_path = Path('data/')
train_file_path = Path('trees/train.txt')

download_file_maybe_extract(
    url='http://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip',
    directory=directory_path,
    check_files=[train_file_path])

open(directory_path / train_file_path)

Don't worry we'll handle caching for you!

2. Text to Tensor

Tokenize and encode your text as a tensor.

For example, a WhitespaceEncoder breaks text into tokens whenever it encounters a whitespace character.

from torchnlp.encoders.text import WhitespaceEncoder

loaded_data = ["now this ain't funny", "so don't you dare laugh"]
encoder = WhitespaceEncoder(loaded_data)
encoded_data = [encoder.encode(example) for example in loaded_data]

3. Tensor to Batch

With your loaded and encoded data in hand, you'll want to batch your dataset.

import torch
from torchnlp.samplers import BucketBatchSampler
from torchnlp.utils import collate_tensors
from torchnlp.encoders.text import stack_and_pad_tensors

encoded_data = [torch.randn(2), torch.randn(3), torch.randn(4), torch.randn(5)]

train_sampler = torch.utils.data.sampler.SequentialSampler(encoded_data)
train_batch_sampler = BucketBatchSampler(
    train_sampler, batch_size=2, drop_last=False, sort_key=lambda i: encoded_data[i].shape[0])

batches = [[encoded_data[i] for i in batch] for batch in train_batch_sampler]
batches = [collate_tensors(batch, stack_tensors=stack_and_pad_tensors) for batch in batches]

PyTorch-NLP builds on top of PyTorch's existing torch.utils.data.sampler, torch.stack and default_collate to support sequential inputs of varying lengths!

4. Training and Inference

With your batch in hand, you can use PyTorch to develop and train your model using gradient descent. For example, check out this example code for training on the Stanford Natural Language Inference (SNLI) Corpus.

Last But Not Least

PyTorch-NLP has a couple more NLP focused utility packages to support you! 🤗

Deterministic Functions

Now you've setup your pipeline, you may want to ensure that some functions run deterministically. Wrap any code that's random, with fork_rng and you'll be good to go, like so:

import random
import numpy
import torch

from torchnlp.random import fork_rng

with fork_rng(seed=123):  # Ensure determinism
    print('Random:', random.randint(1, 2**31))
    print('Numpy:', numpy.random.randint(1, 2**31))
    print('Torch:', int(torch.randint(1, 2**31, (1,))))

This will always print:

Random: 224899943
Numpy: 843828735
Torch: 843828736

Pre-Trained Word Vectors

Now that you've computed your vocabulary, you may want to make use of pre-trained word vectors to set your embeddings, like so:

import torch
from torchnlp.encoders.text import WhitespaceEncoder
from torchnlp.word_to_vector import GloVe

encoder = WhitespaceEncoder(["now this ain't funny", "so don't you dare laugh"])

vocab_set = set(encoder.vocab)
pretrained_embedding = GloVe(name='6B', dim=100, is_include=lambda w: w in vocab_set)
embedding_weights = torch.Tensor(encoder.vocab_size, pretrained_embedding.dim)
for i, token in enumerate(encoder.vocab):
    embedding_weights[i] = pretrained_embedding[token]

Neural Networks Layers

For example, from the neural network package, apply the state-of-the-art LockedDropout:

import torch
from torchnlp.nn import LockedDropout

input_ = torch.randn(6, 3, 10)
dropout = LockedDropout(0.5)

# Apply a LockedDropout to `input_`
dropout(input_) # RETURNS: torch.FloatTensor (6x3x10)

Metrics

Compute common NLP metrics such as the BLEU score.

from torchnlp.metrics import get_moses_multi_bleu

hypotheses = ["The brown fox jumps over the dog 笑"]
references = ["The quick brown fox jumps over the lazy dog 笑"]

# Compute BLEU score with the official BLEU perl script
get_moses_multi_bleu(hypotheses, references, lowercase=True)  # RETURNS: 47.9

Help

Maybe looking at longer examples may help you at examples/.

Need more help? We are happy to answer your questions via Gitter Chat

Contributing

We've released PyTorch-NLP because we found a lack of basic toolkits for NLP in PyTorch. We hope that other organizations can benefit from the project. We are thankful for any contributions from the community.

Contributing Guide

Read our contributing guide to learn about our development process, how to propose bugfixes and improvements, and how to build and test your changes to PyTorch-NLP.

Related Work

torchtext

torchtext and PyTorch-NLP differ in the architecture and feature set; otherwise, they are similar. torchtext and PyTorch-NLP provide pre-trained word vectors, datasets, iterators and text encoders. PyTorch-NLP also provides neural network modules and metrics. From an architecture standpoint, torchtext is object orientated with external coupling while PyTorch-NLP is object orientated with low coupling.

AllenNLP

AllenNLP is designed to be a platform for research. PyTorch-NLP is designed to be a lightweight toolkit.

Authors

Citing

If you find PyTorch-NLP useful for an academic publication, then please use the following BibTeX to cite it:

@misc{pytorch-nlp,
  author = {Petrochuk, Michael},
  title = {PyTorch-NLP: Rapid Prototyping with PyTorch Natural Language Processing (NLP) Tools},
  year = {2018},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/PetrochukM/PyTorch-NLP}},
}
Comments
  • Aligned FastText embeddings

    Aligned FastText embeddings

    Adds a boolean aligned option to the FastText object's constructor.

    If set to True, the FastText embeddings will be initialized with the aligned MUSE embeddings (see details here).

    If not specified or set to False, the regular FastText embeddings are used. This way, the PR does not break any code written before the PR.

    Example usage:

    >>> from torchnlp.word_to_vector import FastText
    >>> from scipy.spatial.distance import euclidean as dist
    
    # Load aligned FastText embeddings for English and French
    en_vectors = FastText(aligned=True)
    fr_vectors = FastText(language='fr', aligned=True)
    
    # Compare the euclidean distances of semantically related vs unrelated words
    >>> dist(en_vectors['car'], fr_vectors['voiture'])
    0.61194908618927
    >>> dist(en_vectors['car'], fr_vectors['baguette'])
    1.2417925596237183
    
    opened by floscha 6
  • Special tokens should be properly encoded by text_encoders

    Special tokens should be properly encoded by text_encoders

    Expected Behavior

    encoder = MosesEncoder( ["<s> hello This ain't funny. </s>", "<s> Don't? </s>"]) print (encoder.encode("<s> hello </s>"))

    --CONSOLE--- tensor([3, 5, 2])

    Actual Behavior

    encoder = MosesEncoder( ["<s> hello This ain't funny. </s>", "<s> Don't? </s>"]) print (encoder.encode("<s> hello </s>"))

    --CONSOLE--- tensor([ 5, 6, 7, 8, 5, 14, 6, 7])

    Explanation

    Most if this tokenizers are not aware of this special tokens and end up splitting the special token into different tokens. For instance the '<s>' token becames '<', 's', '>'.

    My solution to this problem was to create a method for masking special tokens and another one to restore them in place.

       def _mask_reserved_tokens(self, sequence):
            reserved_tokens = re.findall(r'\<pad\>|\<unk\>|\</s\>|\<s\>|\<copy\>', sequence)
            sequence = re.sub(r'\<pad\>|\<unk\>|\</s\>|\<s\>|\<copy\>', "RESERVEDTOKENMASK", sequence)
            return reserved_tokens, sequence
    
        def _restore_reserved_tokens(self, reserved_tokens, sequence):
            sequence = _detokenize(sequence)
            for token in reserved_tokens:
                sequence = sequence.replace('RESERVEDTOKENMASK', token, 1)
            return _tokenize(sequence)
    

    Then the encode function becames:

    def encode(self, sequence):
            """ Encodes a ``sequence``.
            Args:
                sequence (str): String ``sequence`` to encode.
            Returns:
                torch.Tensor: Encoding of the ``sequence``.
            """
            sequence = super().encode(sequence)
            reserved_tokens, sequence = self._mask_reserved_tokens(sequence)
            sequence = self.tokenize(sequence)
            sequence = self._restore_reserved_tokens(reserved_tokens, sequence)
            vector = [self.stoi.get(token, self.unknown_index) for token in sequence]
            if self.append_eos:
                vector.append(self.eos_index)
            return torch.tensor(vector)
    

    I dont know if this is just a problem that I have but If not I believe that this should be handled natively.

    enhancement help wanted good first issue 
    opened by ricardorei 5
  • Add GLUE dataset (but still one issue with QQP and SNLI, see comment)

    Add GLUE dataset (but still one issue with QQP and SNLI, see comment)

    There is one remaining issue that lies with the dataset itself. The code works for all the datasets except for SNLI and QQP where there are some lines in the data file that contain too much or too little data fields. See my comment in the issue discussion for more explanations. I think that someone that is used to those datasets should be able to know what to do.

    opened by PattynR 5
  • remove lambdas for pickle

    remove lambdas for pickle

    lambda cannot be pickled, therefore it is better to not use it as attribute. Even though there are alternatives to pickle, some libraries use pickle internally, which is why it's better to support it.

    Tests were added to all samplers and encoders for whether objects can be pickled.

    opened by benjamin-work 5
  • RuntimeError: Vector for token darang has 230 dimensions, but previously read vectors have 300 dimensions. All vectors must have the same number of dimensions.

    RuntimeError: Vector for token darang has 230 dimensions, but previously read vectors have 300 dimensions. All vectors must have the same number of dimensions.

    Expected Behavior

    Load FastText vectors

    Environment: Ubuntu 16.04 Python 3.6.4 Pytorch 0.4.1

    Actual Behavior

    Throws the following error:

    File "", line 1, in File "/home/zxi/.local/lib/python3.6/site-packages/torchnlp/word_to_vector/fast_text.py", line 83, in init super(FastText, self).init(name, url=url, **kwargs) File "/home/zxi/.local/lib/python3.6/site-packages/torchnlp/word_to_vector/pretrained_word_vectors.py", line 72, in init self.cache(name, cache, url=url) File "/home/zxi/.local/lib/python3.6/site-packages/torchnlp/word_to_vector/pretrained_word_vectors.py", line 153, in cache word, len(entries), dim)) RuntimeError: Vector for token darang has 230 dimensions, but previously read vectors have 300 dimensions. All vectors must have the same number of dimensions.

    Steps to Reproduce the Problem

    1. Open python console
    2. Write the following code:
          from torchnlp.word_to_vector import FastText
          vectors = FastText()
      
      
    3. Throws the error mentioned above.
    opened by aurooj 4
  • Apply SRU layer:  'Variable' object has no attribute 'new_zeros'

    Apply SRU layer: 'Variable' object has no attribute 'new_zeros'

    Unable to recreate the SRU example code provided in readme. Getting a Variable object has no attribute 'new_zeros'. Running pytorch .3.0.

    Expected Behavior

    RETURNS: ( output [torch.FloatTensor (6x3x20)], hidden_state [torch.FloatTensor (2x3x20)] )

    Actual Behavior

    `--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) in () ----> 1 sru(input_)

    /anaconda3/envs/nlp/lib/python3.6/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs) 355 result = self._slow_forward(*input, **kwargs) 356 else: --> 357 result = self.forward(*input, **kwargs) 358 for hook in self._forward_hooks.values(): 359 hook_result = hook(self, input, result)

    /anaconda3/envs/nlp/lib/python3.6/site-packages/torchnlp/nn/sru.py in forward(self, input_, c0) 509 dir_ = 2 if self.bidirectional else 1 510 if c0 is None: --> 511 zeros = input_.new_zeros(input_.size(1), self.hidden_size * dir_) 512 c0 = [zeros for i in range(self.num_layers)] 513 else:

    /anaconda3/envs/nlp/lib/python3.6/site-packages/torch/autograd/variable.py in getattr(self, name) 65 if name in self._fallthrough_methods: 66 return getattr(self.data, name) ---> 67 return object.getattribute(self, name) 68 69 def getitem(self, key):

    AttributeError: 'Variable' object has no attribute 'new_zeros' `

    Steps to Reproduce the Problem

    `from torchnlp.nn import SRU import torch

    input_ = torch.autograd.Variable(torch.randn(6, 3, 10)) sru = SRU(10, 20)

    sru(input_) `

    opened by dhairyadalal 4
  • Consider not using lambdas as default arguments to enable pickling

    Consider not using lambdas as default arguments to enable pickling

    Problem description

    Some classes have lambdas as default arguments (e.g. here). This prevents these objects from being pickled.

    Steps to Reproduce the Problem

    import pickle
    from torchnlp.text_encoders import StaticTokenizerEncoder
    
    encoder = StaticTokenizerEncoder(['hi', 'you'])
    pickle.dumps(encoder) 
    
    # raises error
    PicklingError: Can't pickle <function StaticTokenizerEncoder.<lambda> at 0x7fb77c042b70>: attribute lookup StaticTokenizerEncoder.<lambda> on torchnlp.text_encoders.static_tokenizer_encoder failed
    

    Solution

    Replace lambdas by actual functions.

    opened by benjamin-work 4
  • Inefficient embedding loading code in README.md

    Inefficient embedding loading code in README.md

    In the readme where loading the glove embeddings is demonstrated, the lambda function is extremely inefficient. The encoder vocab is converted in to a set every single time the function is called. Instead of:

    import torch
    from torchnlp.encoders.text import WhitespaceEncoder
    from torchnlp.word_to_vector import GloVe
    
    encoder = WhitespaceEncoder(["now this ain't funny", "so don't you dare laugh"])
    
    pretrained_embedding = GloVe(name='6B', dim=100, is_include=lambda w: w in set(encoder.vocab))
    embedding_weights = torch.Tensor(encoder.vocab_size, pretrained_embedding.dim)
    for i, token in enumerate(encoder.vocab):
        embedding_weights[i] = pretrained_embedding[token]
    

    it should be:

    import torch
    from torchnlp.encoders.text import WhitespaceEncoder
    from torchnlp.word_to_vector import GloVe
    
    encoder = WhitespaceEncoder(["now this ain't funny", "so don't you dare laugh"])
    
    vocab_set = set(encoder.vocab)
    
    pretrained_embedding = GloVe(name='6B', dim=100, is_include=lambda w: w in vocab_set)
    embedding_weights = torch.Tensor(encoder.vocab_size, pretrained_embedding.dim)
    for i, token in enumerate(encoder.vocab):
        embedding_weights[i] = pretrained_embedding[token]
    

    For my personal code (where the vocab was much much larger than the example) this changed the time taken to load the embeddings from 2 hours to ~10 seconds.

    opened by JamieCT 3
  • Contains method for word vectors

    Contains method for word vectors

    This PR implements the common __contains__() method for the _PretrainedWordVectors class. This way, the Python in keyword can be used to easily find out if a vector representation of a token exists.

    Example usage:

    >>> from torchnlp.word_to_vector import FastText
    >>> vectors = FastText()
    
    >>> 'the' in vectors
    True
    >>> 'theqwe' in vectors
    False
    
    opened by floscha 3
  • Fix: BucketBatchSampler's lambda in snli example

    Fix: BucketBatchSampler's lambda in snli example

    • The BucketBatchSampler's sort_key lambda function in examples/snli/train.py used the row variable, which was loaded due to the for loop a few dozens line above. This also meant that the sort_key lambda function did not have too much of an effect, as it always returned the same number.

    • The same issue can be seen in case of the dev dataset.

    • This commit fixes the situation by making use of the passed index and the loaded train and dev datasets.

    opened by mrshu 2
  • Replaced .view with .reshape on Attention

    Replaced .view with .reshape on Attention

    Replaced .view with .reshape in Attention module in order to avoid RuntimeError: "view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.".

    opened by jmribeiro 2
  • docs: fix simple typo, experessed -> expressed

    docs: fix simple typo, experessed -> expressed

    There is a small typo in torchnlp/encoders/text/subword_text_tokenizer.py.

    Should read expressed rather than experessed.

    Semi-automated pull request generated by https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md

    opened by timgates42 0
  • PackagesNotFoundError in anaconda

    PackagesNotFoundError in anaconda

    I wanted to install this package in anaconda with "conda install torchnlp", but it came out with the "PackagesNotFoundError" notion, how can I install it in anaconda?

    opened by SihanLiuEcho 0
  • Error in SpacyEncoder when language argument is passed

    Error in SpacyEncoder when language argument is passed

    Expected Behavior

    from torchnlp.encoders.text import SpacyEncoder encoder = SpacyEncoder(["This ain't funny.", "Don't?"], language='en

    Actual Behavior

    TypeError: init() got an unexpected keyword argument 'language'

    opened by enaserianhanzaei 0
  • wmt_dataset download failed

    wmt_dataset download failed

    Expected Behavior

    • I tried to follow example of pytorch nlp documentation with wmt14 dataset. (https://pytorchnlp.readthedocs.io/en/latest/source/torchnlp.datasets.html)
    • download wmt dataset successfully

    Actual Behavior

    • wmt_dataset [DOWNLOAD_FAILED] occurs.

    Steps to Reproduce the Problem

    1. install pytorch-nlp 0.5.0
    2. from torchnlp.datasets import wmt_dataset
    3. train=wmt_dataset(train=True)
    >>> train = wmt_dataset(train=True)
    tar: Error opening archive: Unrecognized archive format
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/local/lib/python3.9/site-packages/torchnlp/datasets/wmt.py", line 63, in wmt_dataset
        download_file_maybe_extract(
      File "/usr/local/lib/python3.9/site-packages/torchnlp/download.py", line 170, in download_file_maybe_extract
        raise ValueError('[DOWNLOAD FAILED] `*check_files` not found')
    ValueError: [DOWNLOAD FAILED] `*check_files` not found
    
    opened by chloejiwon 2
  • Gating for inputs

    Gating for inputs

    Hello,

    I was hoping to get some pointers related to the below query. I want to apply gating to my inputs using attention such that only important ones go forward in the network. While searching for inbuilt attention libraries in PyTorch, I came across this. But I having problems understanding what the inputs are (i.e context and query). Would really appreciate any pointers.

    Thanks, Neha

    opened by NehaMotlani 0
Releases(0.5.0)
  • 0.5.0(Nov 4, 2019)

    Major Updates

    • Updated my README emoji game to be more ambiguous while maintaining fun and heartwarming vibe. 🐕
    • Support for Python 3.5
    • Extensive rewrite of README to focus on new users and building an NLP pipeline.
    • Support for Pytorch 1.2
    • Added torchnlp.random for finer grain control of random state building on PyTorch's fork_rng. This module controls the random state of torch, numpy and random.
    import random
    import numpy
    import torch
    
    from torchnlp.random import fork_rng
    
    with fork_rng(seed=123):  # Ensure determinism
        print('Random:', random.randint(1, 2**31))
        print('Numpy:', numpy.random.randint(1, 2**31))
        print('Torch:', int(torch.randint(1, 2**31, (1,))))
    
    • Refactored torchnlp.samplers enabling pipelining. For example:
    from torchnlp.samplers import DeterministicSampler
    from torchnlp.samplers import BalancedSampler
    
    data = ['a', 'b', 'c'] + ['c'] * 100
    sampler = BalancedSampler(data, num_samples=3)
    sampler = DeterministicSampler(sampler, random_seed=12)
    print([data[i] for i in sampler])  # ['c', 'b', 'a']
    
    • Added torchnlp.samplers.balanced_sampler for balanced sampling extending Pytorch's WeightedRandomSampler.
    • Added torchnlp.samplers.deterministic_sampler for deterministic sampling based on torchnlp.random.
    • Added torchnlp.samplers.distributed_batch_sampler for distributed batch sampling.
    • Added torchnlp.samplers.oom_batch_sampler to sample large batches first in order to force an out-of-memory error.
    • Added torchnlp.utils.lengths_to_mask to help create masks from a batch of sequences.
    • Added torchnlp.utils.get_total_parameters to measure the number of parameters in a model.
    • Added torchnlp.utils.get_tensors to measure the size of an object in number of tensor elements. This is useful for dynamic batch sizing and for torchnlp.samplers.oom_batch_sampler.
    from torchnlp.utils import get_tensors
    
    random_object_ = tuple([{'t': torch.tensor([1, 2])}, torch.tensor([2, 3])])
    tensors = get_tensors(random_object_)
    assert len(tensors) == 2
    
    • Added a corporate sponsor to the library: https://wellsaidlabs.com/

    Minor Updates

    • Fixed snli example (https://github.com/PetrochukM/PyTorch-NLP/pull/84)
    • Updated .gitignore to support Python's virtual environments (https://github.com/PetrochukM/PyTorch-NLP/pull/84)
    • Removed requests and pandas dependency. There are only two dependencies remaining. This is useful for production environments. (https://github.com/PetrochukM/PyTorch-NLP/pull/84)
    • Added LazyLoader to reduce dependency requirements. (https://github.com/PetrochukM/PyTorch-NLP/commit/4e84780a8a741d6a90f2752edc4502ab2cf89ecb)
    • Removed unused torchnlp.datasets.Dataset class in favor of basic Python dictionary lists and pandas. (https://github.com/PetrochukM/PyTorch-NLP/pull/84)
    • Support for downloading tar.gz files and unpacking them faster. (https://github.com/PetrochukM/PyTorch-NLP/commit/eb61fee854576c8a57fd9a20ee03b6fcb89c493a)
    • Rename itos and stoi to index_to_token and token_to_index respectively. (https://github.com/PetrochukM/PyTorch-NLP/pull/84)
    • Fixed batch_encode, batch_decode, and enforce_reversible for torchnlp.encoders.text (https://github.com/PetrochukM/PyTorch-NLP/pull/69)
    • Fix FastText vector downloads (https://github.com/PetrochukM/PyTorch-NLP/pull/72)
    • Fixed documentation for LockedDropout (https://github.com/PetrochukM/PyTorch-NLP/pull/73)
    • Fixed bug in weight_drop (https://github.com/PetrochukM/PyTorch-NLP/pull/76)
    • stack_and_pad_tensors now returns a named tuple for readability (https://github.com/PetrochukM/PyTorch-NLP/pull/84)
    • Added torchnlp.utils.split_list in favor of torchnlp.utils.resplit_datasets. This is enabled by the modularity of torchnlp.random. (https://github.com/PetrochukM/PyTorch-NLP/pull/84)
    • Deprecated torchnlp.utils.datasets_iterator in favor of Pythons itertools.chain. (https://github.com/PetrochukM/PyTorch-NLP/pull/84)
    • Deprecated torchnlp.utils.shuffle in favor of torchnlp.random. (https://github.com/PetrochukM/PyTorch-NLP/pull/84)
    • Support encoding larger datasets following fixing this issue (https://github.com/PetrochukM/PyTorch-NLP/issues/85).
    • Added torchnlp.samplers.repeat_sampler following up on this issue: https://github.com/pytorch/pytorch/issues/15849
    Source code(tar.gz)
    Source code(zip)
  • 0.4.0(Apr 3, 2019)

    Major updates

    • Rewrote encoders to better support more generic encoders like a LabelEncoder. Furthermore, added broad support for batch_encode, batch_decode and enforce_reversible.
    • Rearchitected default reserved tokens to ensure configurability while still providing the convenience of good defaults.
    • Added support to collate sequences with torch.utils.data.dataloader.DataLoader. For example:
    from functools import partial
    from torchnlp.utils import collate_tensors
    from torchnlp.encoders.text import stack_and_pad_tensors
    
    collate_fn = partial(collate_tensors, stack_tensors=stack_and_pad_tensors)
    torch.utils.data.dataloader.DataLoader(*args, collate_fn=collate_fn, **kwargs)
    
    • Added doctest support ensuring the documented examples are tested.
    • Removed SRU support, it's too heavy of a module to support. Please use https://github.com/taolei87/sru instead. Happy to accept a PR with a better tested and documented SRU module!
    • Update version requirements to support Python 3.6 and 3.7, dropping support for Python 3.5.
    • Updated version requirements to support PyTorch 1.0+.
    • Merged https://github.com/PetrochukM/PyTorch-NLP/pull/66 reducing the memory requirements for pre-trained word vectors by 2x.

    Minor Updates

    • Formatted the code base with YAPF.
    • Fixed pandas and collections warnings.
    • Added invariant assertion to Encoder via enforce_reversible. For example:
      encoder = Encoder().enforce_reversible()
      

      Ensuring Encoder.decode(Encoder.encode(object)) == object

    • Fixed the accuracy metric for PyTorch 1.0.
    Source code(tar.gz)
    Source code(zip)
  • 0.3.7.post1(Dec 9, 2018)

  • 0.3.0(May 6, 2018)

    Release 0.3.0

    Major Features And Improvements

    • Upgraded to PyTorch 0.4.0
    • Added Byte-Pair Encoding (BPE) pre-trained subword embeddings in 275 languages
    • Refactored download scripts to torchnlp.downloads
    • Enable Spacy encoder to run in multiple languages.
    • Added a boolean aligned option to FastText supporting MUSE (Multilingual Unsupervised and Supervised Embeddings)

    Bug Fixes and Other Changes

    • Create non-existent cache dirs for torchnlp.word_to_vector.
    • Add set operation to torchnlp.datasets.Dataset with support for slices, columns and rows
    • Updated biggest_batches_first in torchnlp.samplers to be more efficient at approximating memory then Pickle
    • Enabled torch.utils.pad_tensor and torch.utils. pad_batch to support N dimensional tensors
    • Updated to sacremoses to fix NLTK moses dependancy for torch.text_encoders
    • Added __getitem()__ for _PretrainedWordVectors. For example:
    from torchnlp.word_to_vector import FastText
    vectors = FastText()
    tokenized_sentence = ['this', 'is', 'a', 'sentence']
    vectors[tokenized_sentence]
    
    • Added __contains__ for _PretrainedWordVectors. For example:
    >>> from torchnlp.word_to_vector import FastText
    >>> vectors = FastText()
    
    >>> 'the' in vectors
    True
    >>> 'theqwe' in vectors
    False
    
    Source code(tar.gz)
    Source code(zip)
Owner
Michael Petrochuk
World Record Holder • Deep Learning (DL) Engineer & Researcher • CTO @ https://wellsaidlabs.com
Michael Petrochuk
Code for "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022.

README Code for Two-stage Identifier: "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022. For details of the model a

Yongliang Shen 45 Nov 29, 2022
Python-zhuyin - An open source Python library that provides a unified interface for converting between Chinese pinyin and Zhuyin (bopomofo)

Python-zhuyin - An open source Python library that provides a unified interface for converting between Chinese pinyin and Zhuyin (bopomofo)

2 Dec 29, 2022
A Plover python dictionary allowing for consistent symbol input with specification of attachment and capitalisation in one stroke.

Emily's Symbol Dictionary Design This dictionary was created with the following goals in mind: Have a consistent method to type (pretty much) every sy

Emily 68 Jan 07, 2023
CATs: Semantic Correspondence with Transformers

CATs: Semantic Correspondence with Transformers For more information, check out the paper on [arXiv]. Training with different backbones and evaluation

74 Dec 10, 2021
It analyze the sentiment of the user, whether it is postive or negative.

Sentiment-Analyzer-Tool It analyze the sentiment of the user, whether it is postive or negative. It uses streamlit library for creating this sentiment

Paras Patidar 18 Dec 17, 2022
2021搜狐校园文本匹配算法大赛baseline

sohu2021-baseline 2021搜狐校园文本匹配算法大赛baseline 简介 分享了一个搜狐文本匹配的baseline,主要是通过条件LayerNorm来增加模型的多样性,以实现同一模型处理不同类型的数据、形成不同输出的目的。 线下验证集F1约0.74,线上测试集F1约0.73。

苏剑林(Jianlin Su) 45 Sep 06, 2022
SentimentArcs: a large ensemble of dozens of sentiment analysis models to analyze emotion in text over time

SentimentArcs - Emotion in Text An end-to-end pipeline based on Jupyter notebooks to detect, extract, process and anlayze emotion over time in text. E

jon_chun 14 Dec 19, 2022
Athena is an open-source implementation of end-to-end speech processing engine.

Athena is an open-source implementation of end-to-end speech processing engine. Our vision is to empower both industrial application and academic research on end-to-end models for speech processing.

Ke Technologies 34 Sep 08, 2022
Implementation of Fast Transformer in Pytorch

Fast Transformer - Pytorch Implementation of Fast Transformer in Pytorch. This only work as an encoder. Yannic video AI Epiphany Install $ pip install

Phil Wang 167 Dec 27, 2022
Estimation of the CEFR complexity score of a given word, sentence or text.

NLP-Swedish … allows to estimate CEFR (Common European Framework of References) complexity score of a given word, sentence or text. CEFR scores come f

3 Apr 30, 2022
profile tools for pytorch nn models

nnprof Introduction nnprof is a profile tool for pytorch neural networks. Features multi profile mode: nnprof support 4 profile mode: Layer level, Ope

Feng Wang 42 Jul 09, 2022
A Python script which randomly chooses and prints a file from a directory.

___ ____ ____ _ __ ___ / _ \ | _ \ | _ \ ___ _ __ | '__| / _ \ | |_| || | | || | | | / _ \| '__| | | | __/ | _ || |_| || |_| || __

yesmaybenookay 0 Aug 06, 2021
HuggingTweets - Train a model to generate tweets

HuggingTweets - Train a model to generate tweets Create in 5 minutes a tweet generator based on your favorite Tweeter Make my own model with the demo

Boris Dayma 318 Jan 04, 2023
A simple recipe for training and inferencing Transformer architecture for Multi-Task Learning on custom datasets. You can find two approaches for achieving this in this repo.

multitask-learning-transformers A simple recipe for training and inferencing Transformer architecture for Multi-Task Learning on custom datasets. You

Shahrukh Khan 48 Jan 02, 2023
State of the Art Natural Language Processing

Spark NLP: State of the Art Natural Language Processing Spark NLP is a Natural Language Processing library built on top of Apache Spark ML. It provide

John Snow Labs 3k Jan 05, 2023
Nested Named Entity Recognition for Chinese Biomedical Text

CBio-NAMER CBioNAMER (Nested nAMed Entity Recognition for Chinese Biomedical Text) is our method used in CBLUE (Chinese Biomedical Language Understand

8 Dec 25, 2022
This Project is based on NLTK It generates a RANDOM WORD from a predefined list of words, From that random word it read out the word, its meaning with parts of speech , its antonyms, its synonyms

This Project is based on NLTK(Natural Language Toolkit) It generates a RANDOM WORD from a predefined list of words, From that random word it read out the word, its meaning with parts of speech , its

SaiVenkatDhulipudi 2 Nov 17, 2021
Klexikon: A German Dataset for Joint Summarization and Simplification

Klexikon: A German Dataset for Joint Summarization and Simplification Dennis Aumiller and Michael Gertz Heidelberg University Under submission at LREC

Dennis Aumiller 8 Jan 03, 2023
Retraining OpenAI's GPT-2 on Discord Chats

Train OpenAI's GPT-2 on Discord Chats Retraining a Text Generation Model on Discord Chats using gpt-2-simple that wraps existing model fine-tuning and

Ayush Mishra 4 Oct 27, 2022
State-of-the-art NLP through transformer models in a modular design and consistent APIs.

Trapper (Transformers wRAPPER) Trapper is an NLP library that aims to make it easier to train transformer based models on downstream tasks. It wraps h

Open Business Software Solutions 42 Sep 21, 2022