Longformer: The Long-Document Transformer

Overview

Longformer

Longformer and LongformerEncoderDecoder (LED) are pretrained transformer models for long documents.

***** New December 1st, 2020: LongformerEncoderDecoder *****

A LongformerEncoderDecoder (LED) model is now available. It supports seq2seq tasks with long input. With gradient checkpointing, fp16, and 48GB gpu, the input length can be up to 16K tokens. Check the updated paper for the model details and evaluation.

  • Pretrained models: 1) led-base-16384, 2) led-large-16384

  • Requirements: Make sure to use the huggingface/transformers fork specified in requirements.txt. It adds support for gradient checkpointing and allows different maximum sequence length for the input and output. You can also run pip install git+https://github.com/allenai/longformer.git

  • Check the script scripts/summarization.py for an example of how to use the model.

***** New July 23rd, 2020: Speed degradation *****

A significant speed degradation in the hugginface/transformers was recenlty discovered and fixed (check this PR for details). To avoid this problem, either use the old release v2.11.0 but it doesn't support gradient checkpointing, or use the master branch. This problem should be fixed with the next hugginface/transformers release.

***** New June 29th, 2020: Easier to use Gradient checkpointing *****

Gradient checkpointing has been released with huggingface/transformers release v3.0.0. Gradient checkpointing reduces memory by 5x which makes it possible to process longer sequences on smaller GPUs. To use, try something like the following:

from transformers import LongformerModel
model = LongformerModel.from_pretrained('allenai/longformer-base-4096', gradient_checkpointing=True)

***** New June 2nd, 2020: Integrating with Huggingface + Train your own long model + Gradient checkpointing *****

  1. Longformer is now integrated in the huggingface/transformers release v2.11.0. Now you can do
from transformers import LongformerModel
model = LongformerModel.from_pretrained("allenai/longformer-base-4096")

The release also includes LongformerForQA and other LongformerForTaskName with automatic setting of global attention.

  1. We added a notebook to show how to convert an existing pretrained model into its "long" version.

  2. Gradient checkpointing has been merged into HF master (check PR). Gradient checkpointing can reduce memory usage significanlty (5x for longformer-base-4096) allowing longer sequences on smaller gpus.

***** New April 27th, 2020: A PyTorch implementation of the sliding window attention *****

We added a PyTorch implementation of the sliding window attention that doesn't require the custom CUDA kernel. It is limited in functionality but more convenient to use for finetuning on downstream tasks.

Advantage: supports CPU, TPU and fp16, which aren't supported by the custom CUDA kernel

Limitations: uses 2x more memory (but fp16 offsets that), and doesn’t support dilation and autoregressive attention (not needed for finetuning)

therefore, it is suitable for finetuning on downstream tasks but not a good choice for language modeling. The code snippit below and the TriviaQA scripts were updated to use this new implementation.

***** End new information *****

How to use

  1. Download pretrained model
  1. Install environment and code

    conda create --name longformer python=3.7
    conda activate longformer
    conda install cudatoolkit=10.0
    pip install git+https://github.com/allenai/longformer.git
  2. Run the model

    import torch
    from longformer.longformer import Longformer, LongformerConfig
    from longformer.sliding_chunks import pad_to_window_size
    from transformers import RobertaTokenizer
    
    config = LongformerConfig.from_pretrained('longformer-base-4096/') 
    # choose the attention mode 'n2', 'tvm' or 'sliding_chunks'
    # 'n2': for regular n2 attantion
    # 'tvm': a custom CUDA kernel implementation of our sliding window attention
    # 'sliding_chunks': a PyTorch implementation of our sliding window attention
    config.attention_mode = 'sliding_chunks'
    
    model = Longformer.from_pretrained('longformer-base-4096/', config=config)
    tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
    tokenizer.model_max_length = model.config.max_position_embeddings
    
    SAMPLE_TEXT = ' '.join(['Hello world! '] * 1000)  # long input document
    
    input_ids = torch.tensor(tokenizer.encode(SAMPLE_TEXT)).unsqueeze(0)  # batch of size 1
    
    # TVM code doesn't work on CPU. Uncomment this if `config.attention_mode = 'tvm'`
    # model = model.cuda(); input_ids = input_ids.cuda()
    
    # Attention mask values -- 0: no attention, 1: local attention, 2: global attention
    attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device) # initialize to local attention
    attention_mask[:, [1, 4, 21,]] =  2  # Set global attention based on the task. For example,
                                         # classification: the <s> token
                                         # QA: question tokens
    
    # padding seqlen to the nearest multiple of 512. Needed for the 'sliding_chunks' attention
    input_ids, attention_mask = pad_to_window_size(
            input_ids, attention_mask, config.attention_window[0], tokenizer.pad_token_id)
    
    output = model(input_ids, attention_mask=attention_mask)[0]

Model pretraining

This notebook demonstrates our procedure for training Longformer starting from the RoBERTa checkpoint. The same procedure can be followed to get a long-version of other existing pretrained models.

TriviaQA

  • Training scripts: scripts/triviaqa.py
  • Pretrained large model: here (replicates leaderboard results)
  • Instructions: scripts/cheatsheet.txt

CUDA kernel

Our custom CUDA kernel is implemented in TVM. For now, the kernel only works on GPUs and Linux. We tested it on Ubuntu, Python 3.7, CUDA10, PyTorch >= 1.2.0. If it doesn't work for your environment, please create a new issue.

Compiling the kernel: We already include the compiled binaries of the CUDA kernel, so most users won't need to compile it, but if you are intersted, check scripts/cheatsheet.txt for instructions.

Known issues

Please check the repo issues for a list of known issues that we are planning to address soon. If your issue is not discussed, please create a new one.

Citing

If you use Longformer in your research, please cite Longformer: The Long-Document Transformer.

@article{Beltagy2020Longformer,
  title={Longformer: The Long-Document Transformer},
  author={Iz Beltagy and Matthew E. Peters and Arman Cohan},
  journal={arXiv:2004.05150},
  year={2020},
}

Longformer is an open-source project developed by the Allen Institute for Artificial Intelligence (AI2). AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.

Comments
  • ImportError: cannot import name 'nvcc'

    ImportError: cannot import name 'nvcc'

    from tvm.contrib import nvcc ImportError: cannot import name 'nvcc'

    I get this when trying to compile the kernel from scratch. Did I miss something in the cmake config? I can import a lot of TVM modules but not nvcc.

    My cuda version is: Cuda compilation tools, release 10.0, V10.0.130

    opened by safooray 33
  • Text Classifier using longformer

    Text Classifier using longformer

    Can we request to add a short example of longformer for long text/review classification? Current triviaQA is good but more examples will encourage further use of longformer.

    Thanks. Patrick

    opened by pchankh 14
  • RuntimeError: CUDA error: device-side assert triggered - is_global_attn = is_index_global_attn.flatten().any().item()

    RuntimeError: CUDA error: device-side assert triggered - is_global_attn = is_index_global_attn.flatten().any().item()

    I'm trying to train a new model from scratch where it's length is 1024 (using huggingface implementation of longformer), but I get the following exception at a line that is recently added:

    --> 150         is_global_attn = is_index_global_attn.flatten().any().item()
        151 
        152         hidden_states = hidden_states.transpose(0, 1)
    
    RuntimeError: CUDA error: device-side assert triggered
    

    I tried Reformer and it worked as expected. The Longfomer config is as follows?

    LongformerConfig {
      "attention_probs_dropout_prob": 0.1,
      "attention_window": 64,
      "bos_token_id": 0,
      "eos_token_id": 2,
      "gradient_checkpointing": false,
      "hidden_act": "gelu",
      "hidden_dropout_prob": 0.1,
      "hidden_size": 768,
      "initializer_range": 0.02,
      "intermediate_size": 3072,
      "layer_norm_eps": 1e-12,
      "max_position_embeddings": 1026,
      "model_type": "longformer",
      "num_attention_heads": 12,
      "num_hidden_layers": 6,
      "pad_token_id": 257,
      "sep_token_id": 258,
      "type_vocab_size": 2,
      "vocab_size": 261
    }
    

    Any idea what the issue is?

    opened by zarandioon 13
  • segmentation fault illegal instruction

    segmentation fault illegal instruction

    setup

    ubuntu 16.04 tvm 0.7 dev1 pytorch 1.4.0 transformer 2.11.0 other same as requirements.txt

    issue

    I uncomment the line in diagonaled_mm_tvm.py DiagonaledMM._get_function('float32', 'cuda')

    After that, When I run the code , it show Loading tvm binary from :./longformer/lib/lib_diagonaled_mm_float32_cuda.so ... segmentation fault (core dump) or show Loading tvm binary from :./longformer/lib/lib_diagonaled_mm_float32_cuda.so ... illegal instruction (core dump)

    other

    I test the tvm, tensorflow and pytorch, there are fine. And I follow the scripts/cheatsheet.txt to regenerate the lib_diagonaled_mm_float32_cuda.so, it can generate succeed.

    Any idea or suggestion?

    the code is below

    import torch
    from longformer.longformer import Longformer, LongformerConfig
    from longformer.sliding_chunks import pad_to_window_size
    from transformers import RobertaTokenizer
    
    config = LongformerConfig.from_pretrained('longformer-base-4096/') 
    # choose the attention mode 'n2', 'tvm' or 'sliding_chunks'
    # 'n2': for regular n2 attantion
    # 'tvm': a custom CUDA kernel implementation of our sliding window attention
    # 'sliding_chunks': a PyTorch implementation of our sliding window attention
    config.attention_mode = 'tvm'
    
    model = Longformer.from_pretrained('longformer-base-4096/', config=config)
    tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
    tokenizer.model_max_length = model.config.max_position_embeddings
    
    SAMPLE_TEXT = ' '.join(['Hello world! '] * 1000)  # long input document
    
    input_ids = torch.tensor(tokenizer.encode(SAMPLE_TEXT)).unsqueeze(0)  # batch of size 1
    
    # TVM code doesn't work on CPU. Uncomment this if `config.attention_mode = 'tvm'`
    model = model.cuda(); input_ids = input_ids.cuda()
    
    # Attention mask values -- 0: no attention, 1: local attention, 2: global attention
    attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device) # initialize to local attention
    attention_mask[:, [1, 4, 21,]] =  2  # Set global attention based on the task. For example,
                                         # classification: the <s> token
                                         # QA: question tokens
    
    # padding seqlen to the nearest multiple of 512. Needed for the 'sliding_chunks' attention
    input_ids, attention_mask = pad_to_window_size(
            input_ids, attention_mask, config.attention_window[0], tokenizer.pad_token_id)
    
    output = model(input_ids, attention_mask=attention_mask)[0]
    
    opened by ProfXGiter 13
  • Using RoBERTa or LongFormer for texts with 16K tokens

    Using RoBERTa or LongFormer for texts with 16K tokens

    LongFormer does it by pooling all the local attentions (512) together in global attention (512 x 8 = 4096).

    This is not entirely true. There's no "pooling" of the 4096 tokens into 512. We keep all 4096 tokens. The only change is how attention is computed; instead of every token attending to every other token, we change it such that every token attends to a smaller number of surrounding tokens. This speeds up selfattention computation (which is the bottleneck) by assuming that the attention score between certain pairs of words is zero. This doesn't change the architecture or introduce any pooling.

    We are working on some code that will make it easy to train your own long model, so you can try longer sequences. We know it is easy to get to 16K or even 32k with RoBERTa-base architecture (need base model, fp16, gradient checkpointing). For sequences longer than that, you will need to find ways to save memory depending on your application. For example, reducing window size, reducing size of the feed forward layers, implementing reversible transformers, use sinusoidal position embedding instead of learned position embedding.

    Originally posted by @ibeltagy in https://github.com/allenai/longformer/issues/48#issuecomment-634270401

    opened by vr25 10
  • Not able to use the embedding for calculating similarity.

    Not able to use the embedding for calculating similarity.

    First of all let me thank you for contributing this knowledge to us. It makes a lot of difference for beginners like me. :) Now the issue: I was trying to use longformer for calculating the similarity between a query and a list of paragraphs retrieved from my index search. The idea is to re-rank these paragraphs based on the the cosine similarity of the embedding of Question and the individual paragraph.

    However, once I have calculated the embedding of both query and paragraph using this code: SAMPLE_TEXT = f'{tokenizer.cls_token}{SAMPLE_TEXT}{tokenizer.eos_token}' ................................... ...................... output = model(input_ids, attention_mask=attention_mask)[0]

    I get a embedding of dimension: torch.Size([1, 512, 768]) and when I try to calculate the cosine similarity on these embeddings I get error saying : ever got this error: RuntimeError: Can't call numpy() on Variable that requires grad. Use var.detach().numpy() instead. while working with torch?

    I do see that the error recommends me to use var.detach().numpy() insteam of numpy(). https://stackoverflow.com/questions/55466298/pytorch-cant-call-numpy-on-variable-that-requires-grad-use-var-detach-num

    However, I am unsure where should I append this line of code. I am a beginner and hence please pardon if I have raised an issue unrelated to longformer.

    Thanks for help :)

    opened by titu1992 10
  • help in understanding task global attention

    help in understanding task global attention

    Hi,

    Need help in understanding the concept below?

    image

    So does this mean that the complexity is quadratic (if all tokens attend to all other tokens) for task tuning but linear otherwise?

    Thanks!

    opened by vr25 9
  • Has anyone reproduced TriviaQA result with pytorch-lightning checkpoint?

    Has anyone reproduced TriviaQA result with pytorch-lightning checkpoint?

    Hi, I'm trying to reproduce the TriviaQA result following instructions in cheatsheet. I user following instructions to reproduce it from cheatsheet.txt

    // To run our pretrained TriviaQA large model (replicates the leaderboard results), // first download the pytorch-lightning checkpoint: // https://ai2-s2-research.s3-us-west-2.amazonaws.com/longformer/triviaqa-longformer-large.tar.gz // then run: python -m scripts.triviaqa
    --train_dataset squad-wikipedia-train-4096.json \ # loaded but not used --dev_dataset squad-wikipedia-dev-4096.json
    --gpus 0 --num_workers 4
    --max_seq_len 4096 --doc_stride -1
    --save_prefix triviaqa-longformer-large \ # pretrained pytorch-lighting checkpoint --model_path path/to/pretrained/longformer-large-4096 \ # loaded but not used --test # predictions will be saved into predictions.json

    //then run the official evaluation scripts python -m scripts.triviaqa_utils.evaluation_utils
    --dataset_file path/to/qa/wikipedia-dev.json
    --prediction_file predictions.json
    //Output should be: {'exact_match': 73.07644188665083, 'f1': 77.78523804802242, 'common': 7993, 'denominator': 7993, 'pred_len': 7993, 'gold_len': 7993}

    But I keep getting result {'exact_match': 0.025021894157387713, 'f1': 4.579085300341775, 'common': 7993, 'denominator': 7993, 'pred_len': 7993, 'gold_len': 7993}, which is very weird..

    I downloaded dataset and converted both train and dev dataset into squad format by provided script, and I just replaced data and model path to my server's setting.

    Has anyone reproduced the result f1:77.78 with given pytorch-lightning checkpoint?

    opened by YJYJLee 9
  • How can I train the pre-train model on chinese corpus?

    How can I train the pre-train model on chinese corpus?

    Now I want to train a pre-train model on chinese corpus, but the details are not clear. such as, how to make the minimal changes necessary to support Longformer’s attention mechanism, how to take the attention pattern to plug into a pretrained transformer model.

    opened by liangxg787 9
  • Fine-tuning Longformer for squad (out of memory)

    Fine-tuning Longformer for squad (out of memory)

    I have pretrained an MLM Longformer using roberta-base based on this recipe.

    Then I tried to fine-tune it for squad quetion-answering. Here is the trainer and following is the run-time setting (based on here):

    python run_squad.py
    --model_type roberta
    --model_name_or_path pathe_to_roberta_base_mlm_trained_4096
    --do_train
    --do_eval
    --do_lower_case
    --train_file $SQUAD_DIR/train-v1.1.json
    --predict_file $SQUAD_DIR/dev-v1.1.json
    --per_gpu_train_batch_size 1
    --learning_rate 3e-5
    --num_train_epochs 2.0
    --max_seq_length 4096
    --doc_stride 128
    --output_dir /tmp/debug_squad/

    While I am using a V100 node (16-GPUs, 32 GB), it always faces memory limit of gpu as follow:

    File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
    

    File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/aaaa/.local/lib/python3.6/site-packages/transformers/modeling_roberta.py", line 642, in forward output_hidden_states=output_hidden_states, File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/aaaa/.local/lib/python3.6/site-packages/transformers/modeling_bert.py", line 762, in forward output_hidden_states=output_hidden_states, File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/aaaa/.local/lib/python3.6/site-packages/transformers/modeling_bert.py", line 439, in forward output_attentions, File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/aaaa/.local/lib/python3.6/site-packages/transformers/modeling_bert.py", line 371, in forward hidden_states, attention_mask, head_mask, output_attentions=output_attentions, File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/aaaa/.local/lib/python3.6/site-packages/transformers/modeling_bert.py", line 315, in forward hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, output_attentions, File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/aaaa/.local/lib/python3.6/site-packages/transformers/modeling_bert.py", line 240, in forward attention_scores = attention_scores / math.sqrt(self.attention_head_size) RuntimeError: CUDA out of memory. Tried to allocate 768.00 MiB (GPU 0; 31.72 GiB total capacity; 30.25 GiB already allocated; 300.38 MiB free; 30.29 GiB reserved in total by PyTorch)

    However, using allenai/longformer-base-4096, it works. Could you please comment on what I may be missing in the above steps.

    opened by arashashari 8
  • CUDA error: device-side assert triggered, while converting BERT to Long

    CUDA error: device-side assert triggered, while converting BERT to Long

    Hi!

    I got an apparently working code for converting a BERT model into a longformer, but now I am trying to convert BERTeus to Longoformer, which I expected to work in the same way (just changing the dataset + model name/path).

    with a small(with big same issue) training corpus (50K lines), the training starts well, but it breaks around step 20, after 3-4 epochs.

    
    2020-09-22 15:01:55.336576: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6
    2020-09-22 15:01:55.338202: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.6
    INFO:__main__:Loading the model from tmp/bert-base-4096
    INFO:transformers.configuration_utils:loading configuration file tmp/bert-base-4096/config.json
    INFO:transformers.configuration_utils:Model config BertConfig {
      "architectures": [
        "BertForMaskedLM"
      ],
      "attention_probs_dropout_prob": 0.1,
      "attention_window": [
        512,
        512,
        512,
        512,
        512,
        512,
        512,
        512,
        512,
        512,
        512,
        512
      ],
      "gradient_checkpointing": true,
      "hidden_act": "gelu",
      "hidden_dropout_prob": 0.1,
      "hidden_size": 768,
      "initializer_range": 0.02,
      "intermediate_size": 3072,
      "layer_norm_eps": 1e-12,
      "max_position_embeddings": 4096,
      "model_type": "bert",
      "num_attention_heads": 12,
      "num_hidden_layers": 12,
      "output_past": true,
      "pad_token_id": 3,
      "type_vocab_size": 2,
      "vocab_size": 50099
    }
    
    INFO:transformers.tokenization_utils_base:Model name 'tmp/bert-base-4096' not found in model shortcut name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased, TurkuNLP/bert-base-finnish-cased-v1, TurkuNLP/bert-base-finnish-uncased-v1, wietsedv/bert-base-dutch-cased). Assuming 'tmp/bert-base-4096' is a path, a model identifier, or url to a directory containing tokenizer files.
    INFO:transformers.tokenization_utils_base:Didn't find file tmp/bert-base-4096/added_tokens.json. We won't load it.
    INFO:transformers.tokenization_utils_base:Didn't find file tmp/bert-base-4096/tokenizer.json. We won't load it.
    INFO:transformers.tokenization_utils_base:loading file tmp/bert-base-4096/vocab.txt
    INFO:transformers.tokenization_utils_base:loading file None
    INFO:transformers.tokenization_utils_base:loading file tmp/bert-base-4096/special_tokens_map.json
    INFO:transformers.tokenization_utils_base:loading file tmp/bert-base-4096/tokenizer_config.json
    INFO:transformers.tokenization_utils_base:loading file None
    /mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_auto.py:798: FutureWarning: The class `AutoModelWithLMHead` is deprecated and will be removed in a future version. Please use `AutoModelForCausalLM` for causal language models, `AutoModelForMaskedLM` for masked language models and `AutoModelForSeq2SeqLM` for encoder-decoder models.
      FutureWarning,
    INFO:transformers.configuration_utils:loading configuration file tmp/bert-base-4096/config.json
    INFO:transformers.configuration_utils:Model config BertConfig {
      "architectures": [
        "BertForMaskedLM"
      ],
      "attention_probs_dropout_prob": 0.1,
      "attention_window": [
        512,
        512,
        512,
        512,
        512,
        512,
        512,
        512,
        512,
        512,
        512,
        512
      ],
      "gradient_checkpointing": true,
      "hidden_act": "gelu",
      "hidden_dropout_prob": 0.1,
      "hidden_size": 768,
      "initializer_range": 0.02,
      "intermediate_size": 3072,
      "layer_norm_eps": 1e-12,
      "max_position_embeddings": 4096,
      "model_type": "bert",
      "num_attention_heads": 12,
      "num_hidden_layers": 12,
      "output_past": true,
      "pad_token_id": 3,
      "type_vocab_size": 2,
      "vocab_size": 50099
    }
    
    INFO:transformers.modeling_utils:loading weights file tmp/bert-base-4096/pytorch_model.bin
    WARNING:transformers.modeling_utils:Some weights of the model checkpoint at tmp/bert-base-4096 were not used when initializing BertForMaskedLM: ['bert.encoder.layer.0.attention.self.query_global.weight', 'bert.encoder.layer.0.attention.self.query_global.bias', 'bert.encoder.layer.0.attention.self.key_global.weight', 'bert.encoder.layer.0.attention.self.key_global.bias', 'bert.encoder.layer.0.attention.self.value_global.weight', 'bert.encoder.layer.0.attention.self.value_global.bias', 'bert.encoder.layer.1.attention.self.query_global.weight', 'bert.encoder.layer.1.attention.self.query_global.bias', 'bert.encoder.layer.1.attention.self.key_global.weight', 'bert.encoder.layer.1.attention.self.key_global.bias', 'bert.encoder.layer.1.attention.self.value_global.weight', 'bert.encoder.layer.1.attention.self.value_global.bias', 'bert.encoder.layer.2.attention.self.query_global.weight', 'bert.encoder.layer.2.attention.self.query_global.bias', 'bert.encoder.layer.2.attention.self.key_global.weight', 'bert.encoder.layer.2.attention.self.key_global.bias', 'bert.encoder.layer.2.attention.self.value_global.weight', 'bert.encoder.layer.2.attention.self.value_global.bias', 'bert.encoder.layer.3.attention.self.query_global.weight', 'bert.encoder.layer.3.attention.self.query_global.bias', 'bert.encoder.layer.3.attention.self.key_global.weight', 'bert.encoder.layer.3.attention.self.key_global.bias', 'bert.encoder.layer.3.attention.self.value_global.weight', 'bert.encoder.layer.3.attention.self.value_global.bias', 'bert.encoder.layer.4.attention.self.query_global.weight', 'bert.encoder.layer.4.attention.self.query_global.bias', 'bert.encoder.layer.4.attention.self.key_global.weight', 'bert.encoder.layer.4.attention.self.key_global.bias', 'bert.encoder.layer.4.attention.self.value_global.weight', 'bert.encoder.layer.4.attention.self.value_global.bias', 'bert.encoder.layer.5.attention.self.query_global.weight', 'bert.encoder.layer.5.attention.self.query_global.bias', 'bert.encoder.layer.5.attention.self.key_global.weight', 'bert.encoder.layer.5.attention.self.key_global.bias', 'bert.encoder.layer.5.attention.self.value_global.weight', 'bert.encoder.layer.5.attention.self.value_global.bias', 'bert.encoder.layer.6.attention.self.query_global.weight', 'bert.encoder.layer.6.attention.self.query_global.bias', 'bert.encoder.layer.6.attention.self.key_global.weight', 'bert.encoder.layer.6.attention.self.key_global.bias', 'bert.encoder.layer.6.attention.self.value_global.weight', 'bert.encoder.layer.6.attention.self.value_global.bias', 'bert.encoder.layer.7.attention.self.query_global.weight', 'bert.encoder.layer.7.attention.self.query_global.bias', 'bert.encoder.layer.7.attention.self.key_global.weight', 'bert.encoder.layer.7.attention.self.key_global.bias', 'bert.encoder.layer.7.attention.self.value_global.weight', 'bert.encoder.layer.7.attention.self.value_global.bias', 'bert.encoder.layer.8.attention.self.query_global.weight', 'bert.encoder.layer.8.attention.self.query_global.bias', 'bert.encoder.layer.8.attention.self.key_global.weight', 'bert.encoder.layer.8.attention.self.key_global.bias', 'bert.encoder.layer.8.attention.self.value_global.weight', 'bert.encoder.layer.8.attention.self.value_global.bias', 'bert.encoder.layer.9.attention.self.query_global.weight', 'bert.encoder.layer.9.attention.self.query_global.bias', 'bert.encoder.layer.9.attention.self.key_global.weight', 'bert.encoder.layer.9.attention.self.key_global.bias', 'bert.encoder.layer.9.attention.self.value_global.weight', 'bert.encoder.layer.9.attention.self.value_global.bias', 'bert.encoder.layer.10.attention.self.query_global.weight', 'bert.encoder.layer.10.attention.self.query_global.bias', 'bert.encoder.layer.10.attention.self.key_global.weight', 'bert.encoder.layer.10.attention.self.key_global.bias', 'bert.encoder.layer.10.attention.self.value_global.weight', 'bert.encoder.layer.10.attention.self.value_global.bias', 'bert.encoder.layer.11.attention.self.query_global.weight', 'bert.encoder.layer.11.attention.self.query_global.bias', 'bert.encoder.layer.11.attention.self.key_global.weight', 'bert.encoder.layer.11.attention.self.key_global.bias', 'bert.encoder.layer.11.attention.self.value_global.weight', 'bert.encoder.layer.11.attention.self.value_global.bias']
    - This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
    - This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    INFO:transformers.modeling_utils:All the weights of BertForMaskedLM were initialized from the model checkpoint at tmp/bert-base-4096.
    If your task is similar to the task the model of the ckeckpoint was trained on, you can already use BertForMaskedLM for predictions without further training.
    INFO:__main__:Pretraining bert-base-4096 ... 
    INFO:filelock:Lock 140392820589624 acquired on cached_lm_BertTokenizerFast_4094_valEusLong.txt.lock
    INFO:transformers.data.datasets.language_modeling:Loading features from cached file cached_lm_BertTokenizerFast_4094_valEusLong.txt [took 0.008 s]
    INFO:filelock:Lock 140392820589624 released on cached_lm_BertTokenizerFast_4094_valEusLong.txt.lock
    INFO:__main__:Loading and tokenizing training data is usually slow: trainEusLong1.txt
    INFO:filelock:Lock 140392820589456 acquired on cached_lm_BertTokenizerFast_4094_trainEusLong1.txt.lock
    INFO:transformers.data.datasets.language_modeling:Loading features from cached file cached_lm_BertTokenizerFast_4094_trainEusLong1.txt [took 0.053 s]
    INFO:filelock:Lock 140392820589456 released on cached_lm_BertTokenizerFast_4094_trainEusLong1.txt.lock
    INFO:transformers.training_args:PyTorch: setting up devices
    INFO:transformers.trainer:You are instantiating a Trainer but W&B is not installed. To use wandb logging, run `pip install wandb; wandb login` see https://docs.wandb.com/huggingface.
    INFO:transformers.trainer:***** Running Evaluation *****
    INFO:transformers.trainer:  Num examples = 70
    INFO:transformers.trainer:  Batch size = 1
    Evaluation:   0%|                                                                                                                                                 | 0/70 [00:00<?, ?it/s]/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/utils/checkpoint.py:25: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
      warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
    Evaluation: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 70/70 [00:21<00:00,  3.22it/s]
    INFO:transformers.trainer:{'eval_loss': 12.326190962110246, 'step': 0}
    INFO:__main__:Initial eval bpc: 17.782934574086813
    INFO:transformers.trainer:***** Running training *****
    INFO:transformers.trainer:  Num examples = 388
    INFO:transformers.trainer:  Num Epochs = 501
    INFO:transformers.trainer:  Instantaneous batch size per device = 1
    INFO:transformers.trainer:  Total train batch size (w. parallel, distributed & accumulation) = 64
    INFO:transformers.trainer:  Gradient Accumulation steps = 64
    INFO:transformers.trainer:  Total optimization steps = 3000
    INFO:transformers.trainer:  Starting fine-tuning.
    Epoch:   0%|                                                                                                                                                     | 0/501 [00:00<?, ?it/sINFO:transformers.trainer:{'loss': 12.102866038680077, 'learning_rate': 6.000000000000001e-08, 'epoch': 0.16494845360824742, 'step': 1}                  | 63/388 [01:18<06:51,  1.27s/it]
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-1
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-1/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-1/pytorch_model.bin
    INFO:transformers.trainer:{'loss': 12.099215269088745, 'learning_rate': 1.2000000000000002e-07, 'epoch': 0.32989690721649484, 'step': 2}                                 | 127/388 [02:50<05:35,  1.29s/it]
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-2
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-2/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-2/pytorch_model.bin
    INFO:transformers.trainer:{'loss': 12.078452616930008, 'learning_rate': 1.8e-07, 'epoch': 0.4948453608247423, 'step': 3}                                                 | 191/388 [04:24<04:14,  1.29s/it]
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-3
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-3/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-3/pytorch_model.bin
    INFO:transformers.trainer:{'loss': 12.023080185055733, 'learning_rate': 2.4000000000000003e-07, 'epoch': 0.6597938144329897, 'step': 4}                                  | 255/388 [05:56<02:50,  1.28s/it]
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-4
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-4/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-4/pytorch_model.bin
    INFO:transformers.trainer:{'loss': 12.003526121377945, 'learning_rate': 3.0000000000000004e-07, 'epoch': 0.8247422680412371, 'step': 5}█████████▉                        | 319/388 [07:29<01:28,  1.29s/it]INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-5
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-5/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-5/pytorch_model.bin
    INFO:transformers.trainer:{'loss': 11.993770495057106, 'learning_rate': 3.6e-07, 'epoch': 0.9896907216494846, 'step': 6}███████████████████████████████████████████████▎ | 383/388 [09:01<00:06,  1.29s/it]
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-6
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-6/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-6/pytorch_model.bin
    Iteration: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 388/388 [09:18<00:00,  1.44s/it]
    Epoch:   0%|▎                                                                                                                                        | 1/501 [09:18<77:36:08, 558.74s/it]                 INFO:transformers.trainer:{'loss': 12.672470852732658, 'learning_rate': 4.2e-07, 'epoch': 1.1649484536082475, 'step': 7}                                                   | 63/388 [01:20<06:58,  1.29s/it]
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-7
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-7/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-7/pytorch_model.bin
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-8
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-8/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-8/pytorch_model.bin
    
    Iteration:  36%|███████████████████████████████████████████████████████▏                                                                                                 | 140/388 [03:21<05:27,  1.32s/iItINFO:transformers.trainer:{'loss': 11.813278079032898, 'learning_rate': 5.4e-07, 'epoch': 1.4948453608247423, 'step': 9}                                                  | 191/388 [04:27<04:15,  1.30s/it]
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-9
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-9/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-9/pytorch_model.bin
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-10
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-10/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-10/pytorch_model.bin
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-11
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-11/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-11/pytorch_model.bin
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-12
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-12/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-12/pytorch_model.bin
    Iteration: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 388/388 [09:24<00:00,  1.45s/it]
    Epoch:   0%|▌                                                                                                                                        | 2/501 [18:43<77:40:49, 560.42s/it]<00:00,  2.07s/it]INFO:transformers.trainer:{'loss': 12.117324143648148, 'learning_rate': 7.799999999999999e-07, 'epoch': 2.1649484536082473, 'step': 13}                                     | 63/388 [01:20<06:59,  1.29s/it]
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-13
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-13/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-13/pytorch_model.bin
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-14
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-14/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-14/pytorch_model.bin
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-15
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-15/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-15/pytorch_model.bin
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-16
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-16/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-16/pytorch_model.bin
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-17
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-17/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-17/pytorch_model.bin
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-18
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-18/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-18/pytorch_model.bin
    Iteration: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 388/388 [09:24<00:00,  1.45s/it]
    Epoch:   1%|▊                                                                                                                                        | 3/501 [28:07<77:40:37, 561.52s/it]4<00:00,  2.07s/itINFO:transformers.trainer:{'loss': 11.206573352217674, 'learning_rate': 1.14e-06, 'epoch': 3.1649484536082473, 'step': 19}                                                  | 63/388 [01:20<06:58,  1.29s/it]
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-19
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-19/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-19/pytorch_model.bin
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-20
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-20/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-20/pytorch_model.bin
    
    /pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [467,0,0], thread: [93,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
    /pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [467,0,0], thread: [94,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
    /pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [467,0,0], thread: [95,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
    Iteration:  39%|████████████████████████████████████████████████████████████▋                                                                                             | 153/388 [03:38<05:35,  1.43s/it]
    Epoch:   1%|▊                                                                                                                                        | 3/501 [31:45<87:51:44, 635.15s/it]
    Traceback (most recent call last):
      File "BERTeus2LongB.py", line 305, in <module>
        pretrain_and_evaluate(training_args, model, tokenizer, eval_only=False, model_path=training_args.output_dir)
      File "BERTeus2LongB.py", line 183, in pretrain_and_evaluate
        trainer.train(model_path=model_path)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/trainer.py", line 499, in train
        tr_loss += self._training_step(model, inputs, optimizer)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/trainer.py", line 622, in _training_step
        outputs = model(**inputs)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
        result = self.forward(*input, **kwargs)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 1083, in forward
        output_hidden_states=output_hidden_states,
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
        result = self.forward(*input, **kwargs)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 753, in forward
        input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
        result = self.forward(*input, **kwargs)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 182, in forward
        embeddings = inputs_embeds + position_embeddings + token_type_embeddings
    RuntimeError: CUDA error: device-side assert triggered
    

    the same run with

    ###########################################

    os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

    ###########################################

    ...
    Epoch:   1%|▉                                                                                                                                                          | 3/501 [30:52<85:25:53, 617.58s/it]
    Traceback (most recent call last):
      File "BERTeus2LongB.py", line 305, in <module>
        pretrain_and_evaluate(training_args, model, tokenizer, eval_only=False, model_path=training_args.output_dir)
      File "BERTeus2LongB.py", line 183, in pretrain_and_evaluate
        trainer.train(model_path=model_path)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/trainer.py", line 499, in train
        tr_loss += self._training_step(model, inputs, optimizer)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/trainer.py", line 622, in _training_step
        outputs = model(**inputs)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
        result = self.forward(*input, **kwargs)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 1083, in forward
        output_hidden_states=output_hidden_states,
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
        result = self.forward(*input, **kwargs)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 762, in forward
        output_hidden_states=output_hidden_states,
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
        result = self.forward(*input, **kwargs)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 430, in forward
        encoder_attention_mask,
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 155, in checkpoint
        return CheckpointFunction.apply(function, preserve, *args)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 74, in forward
        outputs = run_function(*args)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 420, in custom_forward
        return module(*inputs, output_attentions)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
        result = self.forward(*input, **kwargs)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 371, in forward
        hidden_states, attention_mask, head_mask, output_attentions=output_attentions,
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
        result = self.forward(*input, **kwargs)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 315, in forward
        hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, output_attentions,
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
        result = self.forward(*input, **kwargs)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 243, in forward
        attention_scores = attention_scores + attention_mask
    RuntimeError: CUDA error: device-side assert triggered
    (transformers) [email protected]:/mnt/datuak/gorka-tmp$ python BERTeus2LongB.py
    

    Any hint what causes this error?

    By the way, I also got sometimes this error, which I am not able to reproduce right now:

     File "BERTeus2LongB.py", line 305, in <module>
        pretrain_and_evaluate(training_args, model, tokenizer, eval_only=False, model_path=training_args.output_dir)
      ...
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/functional.py", line 1372, in linear
        output = input.matmul(weight.t())
    RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
    

    Regards, Gorka

    opened by GorkaUrbizu 7
  • Number of tokens per batch mismatch - longformer vs roberta

    Number of tokens per batch mismatch - longformer vs roberta

    I see in your conversion notebook that you suggest that the number of tokens per batch should be the same as roberta: 2^18 = 260k

    When I look at the roberta paper, it says it uses a sequence length of 512 and a batch size of 8k. This means that each batch has 512*8k = 4M tokens

    Am I missing something?

    opened by nbroad1881 1
  • Answering performance of Longformer-base on the HotpotQA dev set

    Answering performance of Longformer-base on the HotpotQA dev set

    Hi,

    I only found Longformer-base's joint F1 on the HopotQA dev set from the paper, and I would like to know if my reproduction results (Ans EM = 61.38, Ans F1 = 75.18) are expected. Could you provide some more specific metrics?

    Thank you!

    opened by zycdev 0
  • CVE-2007-4559 Patch

    CVE-2007-4559 Patch

    Patching CVE-2007-4559

    Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

    If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

    opened by TrellixVulnTeam 0
  • Updated BART to Longformer-encoder-decoder (LED) converter

    Updated BART to Longformer-encoder-decoder (LED) converter

    Hi @ibeltagy et al., I'm pre-training BART to Portuguese and converting the pre-trained model to LED following the instructions you gave in the paper and the code at https://github.com/allenai/longformer/blob/caefee668e39cacdece7dd603a0bebf24df6d8ca/scripts/convert_bart_to_longformerencoderdecoder.py.

    The huggingface library is evolving fast; unfortunately, the code you provided is outdated and I had to implement a new version based on yours.

    I have 2 questions:

    1. Could you tell me if everything is ok or if I missed something? https://gist.github.com/erichans/af745a381b28b1c019f96997ddac4cd7
    2. Is the LEDForConditionalGeneration model uploaded to huggingface just a BART model converted to LED or is there something else?

    Thanks in advance!

    opened by erichans 0
  • Why the TVM impelmentation is memroy efficient

    Why the TVM impelmentation is memroy efficient

    Thanks for your excellent work!

    Just want to discuss the memory reduction problem. It seems that the TVM implementation does not store fewer matrices (like Queries, Keys, and Values matrix). The num of Q-K pairs is less than the full attention so that we can get a faster calculation speed, but why the memory reduction has a similar trend with the time reduction? Seems the TVM kernel does not use any technique to save the memory, and the padding 0 values are also int32, but the fact is that TVM implementation is memory efficient...

    Looking forward to your reply.

    opened by jlidw 0
  • Pretraining longformer for NER on big pdf text

    Pretraining longformer for NER on big pdf text

    Hi, I'm trying to extract entities from documents containing 50-60 pages per document. can anybody suggest a better approach for it, please? I couldn't find any NER implementation of longformers.

    opened by ajaysurya1221 0
Releases(v0.2)
Chinese Grammatical Error Diagnosis

nlp-CGED Chinese Grammatical Error Diagnosis 中文语法纠错研究 基于序列标注的方法 所需环境 Python==3.6 tensorflow==1.14.0 keras==2.3.1 bert4keras==0.10.6 笔者使用了开源的bert4keras

12 Nov 25, 2022
Train BPE with fastBPE, and load to Huggingface Tokenizer.

BPEer Train BPE with fastBPE, and load to Huggingface Tokenizer. Description The BPETrainer of Huggingface consumes a lot of memory when I am training

Lizhuo 1 Dec 23, 2021
Tool to add main subject to items on Wikidata using a WMFs CirrusSearch for named entity recognition or a manually supplied list of QIDs

ItemSubjector Tool made to add main subject statements to items based on the title using a home-brewed CirrusSearch-based Named Entity Recognition alg

Dennis Priskorn 9 Nov 17, 2022
ConvBERT: Improving BERT with Span-based Dynamic Convolution

ConvBERT Introduction In this repo, we introduce a new architecture ConvBERT for pre-training based language model. The code is tested on a V100 GPU.

YITUTech 237 Dec 10, 2022
Wake: Context-Sensitive Automatic Keyword Extraction Using Word2vec

Wake Wake: Context-Sensitive Automatic Keyword Extraction Using Word2vec Abstract استخراج خودکار کلمات کلیدی متون کوتاه فارسی با استفاده از word2vec ب

Omid Hajipoor 1 Dec 17, 2021
A simple Flask site that allows users to create, update, and delete posts in a database, as well as perform basic NLP tasks on the posts.

A simple Flask site that allows users to create, update, and delete posts in a database, as well as perform basic NLP tasks on the posts.

Ian 1 Jan 15, 2022
Official code for "Parser-Free Virtual Try-on via Distilling Appearance Flows", CVPR 2021

Parser-Free Virtual Try-on via Distilling Appearance Flows, CVPR 2021 Official code for CVPR 2021 paper 'Parser-Free Virtual Try-on via Distilling App

395 Jan 03, 2023
this repository has datasets containing information of Uber pickups in NYC from April 2014 to September 2014 and January to June 2015. data Analysis , virtualization and some insights are gathered here

uber-pickups-analysis Data Source: https://www.kaggle.com/fivethirtyeight/uber-pickups-in-new-york-city Information about data set The dataset contain

1 Nov 02, 2021
Data preprocessing rosetta parser for python

datapreprocessing_rosetta_parser I've never done any NLP or text data processing before, so I wanted to use this hackathon as a learning opportunity,

ASReview hackathon for Follow the Money 2 Nov 28, 2021
HAIS_2GNN: 3D Visual Grounding with Graph and Attention

HAIS_2GNN: 3D Visual Grounding with Graph and Attention This repository is for the HAIS_2GNN research project. Tao Gu, Yue Chen Introduction The motiv

Yue Chen 1 Nov 26, 2022
Code for ACL 2020 paper "Rigid Formats Controlled Text Generation"

SongNet SongNet: SongCi + Song (Lyrics) + Sonnet + etc. @inproceedings{li-etal-2020-rigid, title = "Rigid Formats Controlled Text Generation",

Piji Li 212 Dec 17, 2022
Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.

Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.

18 Nov 28, 2022
Chinese NER with albert/electra or other bert descendable model (keras)

Chinese NLP (albert/electra with Keras) Named Entity Recognization Project Structure ./ ├── NER │   ├── __init__.py │   ├── log

2 Nov 20, 2022
aMLP Transformer Model for Japanese

aMLP-japanese Japanese aMLP Pretrained Model aMLPとは、Liu, Daiらが提案する、Transformerモデルです。 ざっくりというと、BERTの代わりに使えて、より性能の良いモデルです。 詳しい解説は、こちらの記事などを参考にしてください。 この

tanreinama 13 Aug 11, 2022
Contains the code and data for our #ICSE2022 paper titled as "CodeFill: Multi-token Code Completion by Jointly Learning from Structure and Naming Sequences"

CodeFill This repository contains the code for our paper titled as "CodeFill: Multi-token Code Completion by Jointly Learning from Structure and Namin

Software Analytics Lab 11 Oct 31, 2022
🤖 Basic Financial Chatbot with handoff ability built with Rasa

Financial Services Example Bot This is an example chatbot demonstrating how to build AI assistants for financial services and banking with Rasa. It in

Mohammad Javad Hossieni 4 Aug 10, 2022
Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch

Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch

Phil Wang 5k Jan 02, 2023
Code associated with the Don't Stop Pretraining ACL 2020 paper

dont-stop-pretraining Code associated with the Don't Stop Pretraining ACL 2020 paper Citation @inproceedings{dontstoppretraining2020, author = {Suchi

AI2 449 Jan 04, 2023
This is a really simple text-to-speech app made with python and tkinter.

Tkinter Text-to-Speech App by Souvik Roy This is a really simple tkinter app which converts the text you have entered into a speech. It is created wit

Souvik Roy 1 Dec 21, 2021
MPNet: Masked and Permuted Pre-training for Language Understanding

MPNet MPNet: Masked and Permuted Pre-training for Language Understanding, by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu, is a novel pre-tr

Microsoft 228 Nov 21, 2022