fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.

Overview

fastNLP

Build Status codecov Pypi Hex.pm Documentation Status

fastNLP是一款轻量级的自然语言处理(NLP)工具包,目标是快速实现NLP任务以及构建复杂模型。

fastNLP具有如下的特性:

  • 统一的Tabular式数据容器,简化数据预处理过程;
  • 内置多种数据集的Loader和Pipe,省去预处理代码;
  • 各种方便的NLP工具,例如Embedding加载(包括ELMo和BERT)、中间数据cache等;
  • 部分数据集与预训练模型的自动下载;
  • 提供多种神经网络组件以及复现模型(涵盖中文分词、命名实体识别、句法分析、文本分类、文本匹配、指代消解、摘要等任务);
  • Trainer提供多种内置Callback函数,方便实验记录、异常捕获等。

安装指南

fastNLP 依赖以下包:

  • numpy>=1.14.2
  • torch>=1.0.0
  • tqdm>=4.28.1
  • nltk>=3.4.1
  • requests
  • spacy
  • prettytable>=0.7.2

其中torch的安装可能与操作系统及 CUDA 的版本相关,请参见 PyTorch 官网 。 在依赖包安装完成后,您可以在命令行执行如下指令完成安装

pip install fastNLP
python -m spacy download en

fastNLP教程

中文文档教程

快速入门

详细使用教程

扩展教程

内置组件

大部分用于的 NLP 任务神经网络都可以看做由词嵌入(embeddings)和两种模块:编码器(encoder)、解码器(decoder)组成。

以文本分类任务为例,下图展示了一个BiLSTM+Attention实现文本分类器的模型流程图:

fastNLP 在 embeddings 模块中内置了几种不同的embedding:静态embedding(GloVe、word2vec)、上下文相关embedding (ELMo、BERT)、字符embedding(基于CNN或者LSTM的CharEmbedding)

与此同时,fastNLP 在 modules 模块中内置了两种模块的诸多组件,可以帮助用户快速搭建自己所需的网络。 两种模块的功能和常见组件如下:

类型 功能 例子
encoder 将输入编码为具有具有表示能力的向量 Embedding, RNN, CNN, Transformer, ...
decoder 将具有某种表示意义的向量解码为需要的输出形式 MLP, CRF, ...

项目结构

fastNLP的大致工作流程如上图所示,而项目结构如下:

fastNLP 开源的自然语言处理库
fastNLP.core 实现了核心功能,包括数据处理组件、训练器、测试器等
fastNLP.models 实现了一些完整的神经网络模型
fastNLP.modules 实现了用于搭建神经网络模型的诸多组件
fastNLP.embeddings 实现了将序列index转为向量序列的功能,包括读取预训练embedding等
fastNLP.io 实现了读写功能,包括数据读入与预处理,模型读写,数据与模型自动下载等

In memory of @FengZiYjun. May his soul rest in peace. We will miss you very very much!

Issues
  • star-transformer何时可以放出完整代码?实验完全无法重现,SST-5数据集上相差6个点哦

    star-transformer何时可以放出完整代码?实验完全无法重现,SST-5数据集上相差6个点哦

    Describe the bug A clear and concise description of what the bug is. 清晰而简要地描述bug

    To Reproduce 使用你们的star-transformer代码,然后用allennlp做训练(glove 42B 词向量), 最后结果见如图,与论文中报告的结果相差6个点。

    请求解释!以及完整版的代码,就是可以完全复现结果的完整版。

    Additional context Add any other context about the problem here. 备注 image

    opened by michael-wzhu 10
  • RuntimeError: CUDA error: device-side assert triggered

    RuntimeError: CUDA error: device-side assert triggered

    Describe the bug 用Predictor方法去加载训练好的模型,在预测时会出现第一张图里面的错误,这个bug被我fixed了。详细请见我在下文上传的项目链接。 出现原因:经过debug分析,发现此bug是由于预测新数据时出现了训练时候没有的新字符,而在bert_embedding.py 脚本里面读取的是训练时候的Vocab维度,并把它初始化成1的vocab向量做mask预测,而这导致了此向量的维度小于实际维度,实际维度=训练时候的Vocab维度+新字符的维度。 Bug结果请看图一,Bug位置及修复请看图二。 image

    image

    To Reproduce 1.把test.txt、dev.txt、train.txt移到data目录下。data目录为自己创建的目录 2. 调用fastNLP_trainer.py脚本 3. 调用fastNLP_predictor.py脚本 4. See error 重现这个bug的步骤

    项目链接:https://github.com/Chris-cbc/fastNLP_Bug_Report_And_Fix.git

    Expected behavior image 上图也是bug修复后出现的结果

    Desktop

    • OS: windows10
    • Python Version: 3.6

    Additional context 请项目主确认后 发邮件并at我github账户一下,让我知道这个bug最终是怎样被修复的 备注

    opened by Chris-cbc 9
  • a new function for argparse

    a new function for argparse

    we should provide a function for arg parse so that we can support "python fastnlp.py --arg1 value1 --arg2 value2" and so on.

    in this way, what argument should we have?

    enhancement 
    opened by xuyige 8
  • fastNLP安装完成之后导入有错

    fastNLP安装完成之后导入有错

    Python 3.5环境下安装fastNLP,显示可以安装成功,但是import fastNLP时会出现 File "D:\anaconda\lib\site-packages\fastNLP\core\instance.py", line 40 f" type={(str(type(self.fields[field_name]))).split(s)[1]}" for field_name in self.fields) + "}" ^ SyntaxError: invalid syntax Python3.6和Python3.7也不行,都是安装完成之后,import时就会报错

    opened by lovelyvivi 8
  • Default value for train args.

    Default value for train args.

    https://github.com/fastnlp/fastNLP/blob/8a87807274735046a48be8eb4b1ca10801875039/fastNLP/core/trainer.py#L42-L45

    Should we set some default value for train_args? Otherwise we will pass all these args every time, which is very redundant.

    opened by keezen 7
  • 为什么BertEmbedding需要传入字典vocab?

    为什么BertEmbedding需要传入字典vocab?

    Bert不是自带一个字典么?能否直接加载使用这个字典呢? 如果修改了字典,那Bert的预训练权重可能意义不大了?

    opened by onebula 7
  • 关于Trainer基本使用部分实例的报错

    关于Trainer基本使用部分实例的报错

    在学习Trainer部分的时候,运行了这一节最开始部分的代码 但是原始的实例代码会报错

    TypeError: can't convert np.ndarray of type numpy.int32. The only supported types are: float64, float32, float16, int64, int32, int16, int8, uint8, and bool.
    

    我尝试在数据生成部分直接使用torch生成tensor

    def generate_psedo_dataset(num_samples):
        data=torch.randint(2,size=(num_samples,10))
        print(data.shape)
        list=[]
        for n in range(num_samples):
            label=torch.sum(data[n])%2
            list.append(label)
        list=torch.stack(list)
        dataset = DataSet({'x':data, 'label': list})
        dataset.set_input('x')
        dataset.set_target('label')
        return dataset
    tr_dataset=generate_psedo_dataset(1000)
    dev_dataset=generate_psedo_dataset(100)
    

    但是在训练中会报如下错误

    TypeError: issubclass() arg 1 must be a class
    

    是不是我的数据生成写错了。。。 gitbook部分的实例代码应该如何调整呢? torch:1.2.0+cu92 FastNLP:0.5.0

    opened by jwc19890114 6
  • 请问现在支持加载Elmo模型到序列标注任务中吗?

    请问现在支持加载Elmo模型到序列标注任务中吗?

    请问现在支持加载Elmo模型到序列标注任务中吗?可以的话是否有example参考,没有的话是否在计划中。多谢!

    opened by Wanjun0511 6
  • [bugfix]修复Trainer里check_code函数忽略pin_memory参数导致的内存bug

    [bugfix]修复Trainer里check_code函数忽略pin_memory参数导致的内存bug

    Description:修复Trainer里check_code函数忽略pin_memory参数导致的内存不足bug

    Main reason: 在使用fastNLP库时发生内存不足错误。使用场景是在使用CPU训练模型时,发生了内存错误。经过DEBUG发现,是core/trainer.py文件里,_check_code函数在调用Tester类时没有指定pin_memory参数,而Tester类默认初始化pin_memory为True。

    具体错误调用栈:

    THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=2 : out of memory
    Traceback (most recent call last):
      File "/data/ouyhlan/TextClassification/main.py", line 52, in <module>
        trainer = Trainer(train_data=data_bundle.get_dataset('train'), model=model, loss=loss,
      File "/home/ouyhlan/miniconda3/envs/env1/lib/python3.9/site-packages/fastNLP/core/trainer.py", line 558, in __init__
        _check_code(dataset=train_data, model=self.model, losser=losser, forward_func=self._forward_func, metrics=metrics,
      File "/home/ouyhlan/miniconda3/envs/env1/lib/python3.9/site-packages/fastNLP/core/trainer.py", line 1013, in _check_code
        evaluate_results = tester.test()
      File "/home/ouyhlan/miniconda3/envs/env1/lib/python3.9/site-packages/fastNLP/core/tester.py", line 184, in test
        for batch_x, batch_y in data_iterator:
      File "/home/ouyhlan/miniconda3/envs/env1/lib/python3.9/site-packages/fastNLP/core/batch.py", line 266, in __iter__
        for indices, batch_x, batch_y in self.dataiter:
      File "/home/ouyhlan/miniconda3/envs/env1/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
        data = self._next_data()
      File "/home/ouyhlan/miniconda3/envs/env1/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 477, in _next_data
        data = _utils.pin_memory.pin_memory(data)
      File "/home/ouyhlan/miniconda3/envs/env1/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 55, in pin_memory
        return [pin_memory(sample) for sample in data]
      File "/home/ouyhlan/miniconda3/envs/env1/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 55, in <listcomp>
        return [pin_memory(sample) for sample in data]
      File "/home/ouyhlan/miniconda3/envs/env1/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 51, in pin_memory
        return {k: pin_memory(sample) for k, sample in data.items()}
      File "/home/ouyhlan/miniconda3/envs/env1/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 51, in <dictcomp>
        return {k: pin_memory(sample) for k, sample in data.items()}
      File "/home/ouyhlan/miniconda3/envs/env1/lib/python3.9/site-packages/torch/utils/data/_utils/pin_memory.py", line 47, in pin_memory
        return data.pin_memory()
    RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/THCCachingHostAllocator.cpp:278
    

    pin_memory参数设为False后问题消失。同时,根据https://github.com/pytorch/pytorch/issues/57273 ,建议所有的torch版本里Trainer和Tester类默认不开启pin_memory。

    Checklist 检查下面各项是否完成

    Please feel free to remove inapplicable items for your PR.

    • [x] The PR title starts with [$CATEGORY] (例如[bugfix]修复bug,[new]添加新功能,[test]修改测试,[rm]删除旧代码)
    • [x] Changes are complete (i.e. I finished coding on this PR) 修改完成才提PR
    • [x] All changes have test coverage 修改的部分顺利通过测试。对于fastnlp/fastnlp/的修改,测试代码必须提供在fastnlp/test/
    • [x] Code is well-documented 注释写好,API文档会从注释中抽取
    • [x] To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change 修改导致例子或tutorial有变化,请找核心开发人员

    Changes: 逐项描述修改的内容

    • Tester和Trainer类默认不开启pin_memory

    Mention: 找人review你的PR @yhcc

    opened by ouyhlan 6
  • elmo_embedding的mask bug

    elmo_embedding的mask bug

    Describe the bug 当device不为cpu时,使用elmo_embedding时发生错误: File "fastNLP/fastNLP/embeddings/elmo_embedding.py", line 329, in forward token_embedding = token_embedding.masked_fill(mask, 0) RuntimeError: expected self and mask to be on the same device, but got mask on cpu and self on cuda:0

    To Reproduce 在使用Trainer,设置device不为cpu进行训练时,model含有elmo_embedding。

    Expected behavior 正常运行

    相关环境 fastNLP=0.6.0或fastNLP=0.7.0

    opened by Pryest 1
  • Metric为什么仅传入dev_data时有效

    Metric为什么仅传入dev_data时有效

    image Metric是只评估验证集么?没大看懂。生成的fitlog里SpanFpreRec与SpanFpreRec-data-test曲线也是完全相同。实在是不明白。。

    opened by Lishumuzixin 1
  • ClassifyFPreRecMetric f_type=’micro' 训练时pre!=acc?

    ClassifyFPreRecMetric f_type=’micro' 训练时pre!=acc?

    Describe the bug 我在做二分类实验时使用fastNLP.core.metrics.ClassifyFPreRecMetric作为模型metric,设置参数f_type=‘micro‘,在模型训练的过程中,针对验证集输出metric结果时,如下图显示 image pre,rec,acc的指标理应都一样,但只满足rec==acc。 当模型训练完后,再次调用ClassifyFPreRecMetric,是正常满足pre==rec==acc。 image

    opened by MrYxJ 1
  • 有计划支持Flair embeddings吗

    有计划支持Flair embeddings吗

    null

    opened by CookiesEason 2
  • fastNLP/io/pipe/init.py中的将 'NG20Pipe' 写成 'NG20Loader'

    fastNLP/io/pipe/init.py中的将 'NG20Pipe' 写成 'NG20Loader'

    Describe the bug 如题

    Screenshots ROIPU_$Y2SZJODLJRPBEAZ9

    opened by 1dAnderson 1
  • 关于文档改进和callback兼容性

    关于文档改进和callback兼容性

    Describe the bug A clear and concise description of what the bug is.

    1. DistTrainer 和 Trainer 不仅是分布式与独立训练的差异,对各种原生callback的支持存在显著不同。 建议增加文档说明各原生callback对两个trainer的支持情况,毕竟有些callback要在若干个epoch运行以后才会触发bug;

    2. (Dist)trainer,指定save path后保存模型,不能指定仅保存参数,而保存完整模型容易触发pickling error;

    To Reproduce Steps to reproduce the behavior:

    例如,DistTrainer 使用 SaveModelCallback,存在bug

      ....
      File "/path_to_env/.conda/envs/pt19/lib/python3.8/site-packages/fastNLP/core/callback.py", line 1089, in on_valid_end
        self._save_this_model(metric_value)
      File "/path_to_env/.conda/envs/pt19/lib/python3.8/site-packages/fastNLP/core/callback.py", line 1112, in _save_this_model
        save_pair, delete_pair = self._insert_into_ordered_save_models((metric_value, name))
      File "//path_to_env/.conda/envs/pt19/lib/python3.8/site-packages/fastNLP/core/callback.py", line 1098, in _insert_into_ordered_save_models
        if not self.trainer.increase_better and _pair[0]<=pair[0]:
    AttributeError: 'DistTrainer' object has no attribute 'increase_better'
    

    另外,fastNLP还有更新计划吗?看起来很久没有更新了

    opened by ccyousa 1
  • 与AllenNLP的对比

    与AllenNLP的对比

    尊敬的fastNLP的作者你们好,最近在调研NLP框架,以用于公司后续的NLP任务。fastNLP的代码结构清晰,文档建设也比较完善,感谢作者们的开源。 在粗略看了fastNLP和allenNLP的架构后,发现在数据集加载和与数据集表示,以及框架整体设计上,两者有很多相通之处,比如都采用tabular的方式来表示数据,模块划分也类似。 能否麻烦fastNLP的作者帮忙对比,相对AllenNLP,fastNLP的主要优势在哪里?感谢!

    opened by rabintang 3
  • 关于分布式教程中的介绍

    关于分布式教程中的介绍

    在使用 nn.DistributedDataParallel 时,模型会被复制到所有使用的GPU,通常每个GPU上存有一个模型,并被一个单独的进程控制。这样有N块GPU,就会产生N个进程。当训练一个batch时,这一batch会被分为N份,每个进程会使用batch的一部分进行训练,然后在必要时进行同步,并通过网络传输需要同步的数据。

    教程中这段话中“当训练一个batch时,这一batch会被分为N份,每个进程会使用batch的一部分进行训练,然后在必要时进行同步,并通过网络传输需要同步的数据。”我觉得不是这样。谈谈自己的理解:我觉得 pytorch的DDP多卡分发数据的时候应该是将Dataset分成N份,然后每个进程从自己分到的数据中抽取batch size进行训练。

    不知道这样理解是不是对的。

    opened by WuDiDaBinGe 6
Releases(v0.6.0)
Owner
fastNLP
由复旦大学的自然语言处理(NLP)团队发起的国产自然语言处理开源项目
fastNLP
All the code I wrote for Overwatch-related projects that I still own the rights to.

overwatch_shit.zip This is (eventually) going to contain all the software I wrote during my five-year imprisonment stay playing Overwatch. I'll be add

zkxjzmswkwl 2 Dec 31, 2021
Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

Grading tools for Advanced NLP (11-711) Installation You'll need docker and unzip to use this repo. For docker, visit the official guide to get starte

Hao Zhu 1 Oct 3, 2021
Coreference resolution for English, German and Polish, optimised for limited training data and easily extensible for further languages

Coreferee Author: Richard Paul Hudson, msg systems ag 1. Introduction 1.1 The basic idea 1.2 Getting started 1.2.1 English 1.2.2 German 1.2.3 Polish 1

msg systems ag 132 Feb 8, 2022
Modular and extensible speech recognition library leveraging pytorch-lightning and hydra.

Lightning ASR Modular and extensible speech recognition library leveraging pytorch-lightning and hydra What is Lightning ASR • Installation • Get Star

Soohwan Kim 35 Sep 27, 2021
A high-level yet extensible library for fast language model tuning via automatic prompt search

ruPrompts ruPrompts is a high-level yet extensible library for fast language model tuning via automatic prompt search, featuring integration with Hugg

Sber AI 27 Feb 7, 2022
Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari Overview | Performance | Installation | Documentation | Contributing ?? ?? ?? We released the 2.0.0 version with TF2 Support. ?? ?? ?? If you

Eliyar Eziz 2.3k Feb 11, 2022
:mag: End-to-End Framework for building natural language search interfaces to data by utilizing Transformers and the State-of-the-Art of NLP. Supporting DPR, Elasticsearch, HuggingFace’s Modelhub and much more!

Haystack is an end-to-end framework that enables you to build powerful and production-ready pipelines for different search use cases. Whether you want

deepset 1.4k Feb 18, 2021
Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.

Kashgari Overview | Performance | Installation | Documentation | Contributing ?? ?? ?? We released the 2.0.0 version with TF2 Support. ?? ?? ?? If you

Eliyar Eziz 2k Feb 9, 2021
✨Rubrix is a production-ready Python framework for exploring, annotating, and managing data in NLP projects.

✨A Python framework to explore, label, and monitor data for NLP projects

Recognai 848 Feb 12, 2022
TextAttack 🐙 is a Python framework for adversarial attacks, data augmentation, and model training in NLP

TextAttack ?? Generating adversarial examples for NLP models [TextAttack Documentation on ReadTheDocs] About • Setup • Usage • Design About TextAttack

QData 1.8k Feb 14, 2022
A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends. IMPORTANT: (30.08.2020) We moved our models

flair 11.2k Feb 11, 2022
A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends. IMPORTANT: (30.08.2020) We moved our models

flair 10k Feb 18, 2021
A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends. Flair is: A powerful NLP library. Flair allo

flair 11.2k Feb 13, 2022
Data loaders and abstractions for text and NLP

torchtext This repository consists of: torchtext.data: Generic data loaders, abstractions, and iterators for text (including vocabulary and word vecto

null 2.9k Feb 12, 2022
Data loaders and abstractions for text and NLP

torchtext This repository consists of: torchtext.data: Generic data loaders, abstractions, and iterators for text (including vocabulary and word vecto

null 2.6k Feb 18, 2021
nlabel is a library for generating, storing and retrieving tagging information and embedding vectors from various nlp libraries through a unified interface.

nlabel is a library for generating, storing and retrieving tagging information and embedding vectors from various nlp libraries through a unified interface.

Bernhard Liebl 1 Feb 3, 2022