Super easy library for BERT based NLP models

Overview

Fast-Bert

License Apache 2.0 PyPI version Python 3.6, 3.7

New - Learning Rate Finder for Text Classification Training (borrowed with thanks from https://github.com/davidtvs/pytorch-lr-finder)

Supports LAMB optimizer for faster training. Please refer to https://arxiv.org/abs/1904.00962 for the paper on LAMB optimizer.

Supports BERT and XLNet for both Multi-Class and Multi-Label text classification.

Fast-Bert is the deep learning library that allows developers and data scientists to train and deploy BERT and XLNet based models for natural language processing tasks beginning with Text Classification.

The work on FastBert is built on solid foundations provided by the excellent Hugging Face BERT PyTorch library and is inspired by fast.ai and strives to make the cutting edge deep learning technologies accessible for the vast community of machine learning practitioners.

With FastBert, you will be able to:

  1. Train (more precisely fine-tune) BERT, RoBERTa and XLNet text classification models on your custom dataset.

  2. Tune model hyper-parameters such as epochs, learning rate, batch size, optimiser schedule and more.

  3. Save and deploy trained model for inference (including on AWS Sagemaker).

Fast-Bert will support both multi-class and multi-label text classification for the following and in due course, it will support other NLU tasks such as Named Entity Recognition, Question Answering and Custom Corpus fine-tuning.

  1. BERT (from Google) released with the paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.
  1. XLNet (from Google/CMU) released with the paper ​XLNet: Generalized Autoregressive Pretraining for Language Understanding by Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.

  2. RoBERTa (from Facebook), a Robustly Optimized BERT Pretraining Approach by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du et al.

  3. DistilBERT (from HuggingFace), released together with the blogpost Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT by Victor Sanh, Lysandre Debut and Thomas Wolf.

Installation

This repo is tested on Python 3.6+.

With pip

PyTorch-Transformers can be installed by pip as follows:

pip install fast-bert

From source

Clone the repository and run:

pip install [--editable] .

or

pip install git+https://github.com/kaushaltrivedi/fast-bert.git

You will also need to install NVIDIA Apex.

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Usage

Text Classification

1. Create a DataBunch object

The databunch object takes training, validation and test csv files and converts the data into internal representation for BERT, RoBERTa, DistilBERT or XLNet. The object also instantiates the correct data-loaders based on device profile and batch_size and max_sequence_length.

from fast_bert.data_cls import BertDataBunch

databunch = BertDataBunch(DATA_PATH, LABEL_PATH,
                          tokenizer='bert-base-uncased',
                          train_file='train.csv',
                          val_file='val.csv',
                          label_file='labels.csv',
                          text_col='text',
                          label_col='label',
                          batch_size_per_gpu=16,
                          max_seq_length=512,
                          multi_gpu=True,
                          multi_label=False,
                          model_type='bert')

File format for train.csv and val.csv

index text label
0 Looking through the other comments, I'm amazed that there aren't any warnings to potential viewers of what they have to look forward to when renting this garbage. First off, I rented this thing with the understanding that it was a competently rendered Indiana Jones knock-off. neg
1 I've watched the first 17 episodes and this series is simply amazing! I haven't been this interested in an anime series since Neon Genesis Evangelion. This series is actually based off an h-game, which I'm not sure if it's been done before or not, I haven't played the game, but from what I've heard it follows it very well pos
2 his movie is nothing short of a dark, gritty masterpiece. I may be bias, as the Apartheid era is an area I've always felt for. pos

In case the column names are different than the usual text and labels, you will have to provide those names in the databunch text_col and label_col parameters.

labels.csv will contain a list of all unique labels. In this case the file will contain:

pos
neg

For multi-label classification, labels.csv will contain all possible labels:

toxic
severe_toxic
obscene
threat
insult
identity_hate

The file train.csv will then contain one column for each label, with each column value being either 0 or 1. Don't forget to change multi_label=True for multi-label classification in BertDataBunch.

id text toxic severe_toxic obscene threat insult identity_hate
0 Why the edits made under my username Hardcore Metallica Fan were reverted? 0 0 0 0 0 0
0 I will mess you up 1 0 0 1 0 0

label_col will be a list of label column names. In this case it will be:

['toxic','severe_toxic','obscene','threat','insult','identity_hate']

Tokenizer

You can either create a tokenizer object and pass it to DataBunch or you can pass the model name as tokenizer and DataBunch will automatically download and instantiate an appropriate tokenizer object.

For example for using XLNet base cased model, set tokenizer parameter to 'xlnet-base-cased'. DataBunch will automatically download and instantiate XLNetTokenizer with the vocabulary for xlnet-base-cased model.

Model Type

Fast-Bert supports XLNet, RoBERTa and BERT based classification models. Set model type parameter value to 'bert', roberta or 'xlnet' in order to initiate an appropriate databunch object.

2. Create a Learner Object

BertLearner is the ‘learner’ object that holds everything together. It encapsulates the key logic for the lifecycle of the model such as training, validation and inference.

The learner object will take the databunch created earlier as as input alongwith some of the other parameters such as location for one of the pretrained models, FP16 training, multi_gpu and multi_label options.

The learner class contains the logic for training loop, validation loop, optimiser strategies and key metrics calculation. This help the developers focus on their custom use-cases without worrying about these repetitive activities.

At the same time the learner object is flexible enough to be customised either via using flexible parameters or by creating a subclass of BertLearner and redefining relevant methods.

from fast_bert.learner_cls import BertLearner
from fast_bert.metrics import accuracy
import logging

logger = logging.getLogger()
device_cuda = torch.device("cuda")
metrics = [{'name': 'accuracy', 'function': accuracy}]

learner = BertLearner.from_pretrained_model(
						databunch,
						pretrained_path='bert-base-uncased',
						metrics=metrics,
						device=device_cuda,
						logger=logger,
						output_dir=OUTPUT_DIR,
						finetuned_wgts_path=None,
						warmup_steps=500,
						multi_gpu=True,
						is_fp16=True,
						multi_label=False,
						logging_steps=50)
parameter description
databunch Databunch object created earlier
pretrained_path Directory for the location of the pretrained model files or the name of one of the pretrained models i.e. bert-base-uncased, xlnet-large-cased, etc
metrics List of metrics functions that you want the model to calculate on the validation set, e.g. accuracy, beta, etc
device torch.device of type cuda or cpu
logger logger object
output_dir Directory for model to save trained artefacts, tokenizer vocabulary and tensorboard files
finetuned_wgts_path provide the location for fine-tuned language model (experimental feature)
warmup_steps number of training warms steps for the scheduler
multi_gpu multiple GPUs available e.g. if running on AWS p3.8xlarge instance
is_fp16 FP16 training
multi_label multilabel classification
logging_steps number of steps between each tensorboard metrics calculation. Set it to 0 to disable tensor flow logging. Keeping this value too low will lower the training speed as model will be evaluated each time the metrics are logged

3. Find the optimal learning rate

The learning rate is one of the most important hyperparameters for model training. We have incorporated the learining rate finder that was proposed by Leslie Smith and then built into the fastai library.

learner.lr_find(start_lr=1e-5,optimizer_type='lamb')

The code is heavily borrowed from David Silva's pytorch-lr-finder library.

Learning rate range test

4. Train the model

learner.fit(epochs=6,
			lr=6e-5,
			validate=True, 	# Evaluate the model after each epoch
			schedule_type="warmup_cosine",
			optimizer_type="lamb")

Fast-Bert now supports LAMB optmizer. Due to the speed of training, we have set LAMB as the default optimizer. You can switch back to AdamW by setting optimizer_type to 'adamw'.

5. Save trained model artifacts

learner.save_model()

Model artefacts will be persisted in the output_dir/'model_out' path provided to the learner object. Following files will be persisted:

File name description
pytorch_model.bin trained model weights
spiece.model sentence tokenizer vocabulary (for xlnet models)
vocab.txt workpiece tokenizer vocabulary (for bert models)
special_tokens_map.json special tokens mappings
config.json model config
added_tokens.json list of new tokens

As the model artefacts are all stored in the same folder, you will be able to instantiate the learner object to run inference by pointing pretrained_path to this location.

6. Model Inference

If you already have a Learner object with trained model instantiated, just call predict_batch method on the learner object with the list of text data:

texts = ['I really love the Netflix original movies',
		 'this movie is not worth watching']
predictions = learner.predict_batch(texts)

If you have persistent trained model and just want to run inference logic on that trained model, use the second approach, i.e. the predictor object.

from fast_bert.prediction import BertClassificationPredictor

MODEL_PATH = OUTPUT_DIR/'model_out'

predictor = BertClassificationPredictor(
				model_path=MODEL_PATH,
				label_path=LABEL_PATH, # location for labels.csv file
				multi_label=False,
				model_type='xlnet',
				do_lower_case=False,
				device=None) # set custom torch.device, defaults to cuda if available

# Single prediction
single_prediction = predictor.predict("just get me result for this text")

# Batch predictions
texts = [
	"this is the first text",
	"this is the second text"
	]

multiple_predictions = predictor.predict_batch(texts)

Language Model Fine-tuning

A useful approach to use BERT based models on custom datasets is to first finetune the language model task for the custom dataset, an apporach followed by fast.ai's ULMFit. The idea is to start with a pre-trained model and further train the model on the raw text of the custom dataset. We will use the masked LM task to finetune the language model.

This section will describe the usage of FastBert to finetune the language model.

1. Import the necessary libraries

The necessary objects are stored in the files with '_lm' suffix.

# Language model Databunch
from fast_bert.data_lm import BertLMDataBunch
# Language model learner
from fast_bert.learner_lm import BertLMLearner

from pathlib import Path
from box import Box

2. Define parameters and setup datapaths

# Box is a nice wrapper to create an object from a json dict
args = Box({
    "seed": 42,
    "task_name": 'imdb_reviews_lm',
    "model_name": 'roberta-base',
    "model_type": 'roberta',
    "train_batch_size": 16,
    "learning_rate": 4e-5,
    "num_train_epochs": 20,
    "fp16": True,
    "fp16_opt_level": "O2",
    "warmup_steps": 1000,
    "logging_steps": 0,
    "max_seq_length": 512,
    "multi_gpu": True if torch.cuda.device_count() > 1 else False
})

DATA_PATH = Path('../lm_data/')
LOG_PATH = Path('../logs')
MODEL_PATH = Path('../lm_model_{}/'.format(args.model_type))

DATA_PATH.mkdir(exist_ok=True)
MODEL_PATH.mkdir(exist_ok=True)
LOG_PATH.mkdir(exist_ok=True)

3. Create DataBunch object

The BertLMDataBunch class contains a static method 'from_raw_corpus' that will take the list of raw texts and create DataBunch for the language model learner.

The method will at first preprocess the text list by removing html tags, extra spaces and more and then create files lm_train.txt and lm_val.txt. These files will be used for training and evaluating the language model finetuning task.

The next step will be to featurize the texts. The text will be tokenized, numericalized and split into blocks on 512 tokens (including special tokens).

databunch_lm = BertLMDataBunch.from_raw_corpus(
					data_dir=DATA_PATH,
					text_list=texts,
					tokenizer=args.model_name,
					batch_size_per_gpu=args.train_batch_size,
					max_seq_length=args.max_seq_length,
                    multi_gpu=args.multi_gpu,
                    model_type=args.model_type,
                    logger=logger)

As this step can take some time based on the size of your custom dataset's text, the featurized data will be cached in pickled files in the data_dir/lm_cache folder.

The next time, instead of using from_raw_corpus method, you may want to directly instantiate the DataBunch object as shown below:

databunch_lm = BertLMDataBunch(
						data_dir=DATA_PATH,
						tokenizer=args.model_name,
                        batch_size_per_gpu=args.train_batch_size,
                        max_seq_length=args.max_seq_length,
                        multi_gpu=args.multi_gpu,
                        model_type=args.model_type,
                        logger=logger)

4. Create the LM Learner object

BertLearner is the ‘learner’ object that holds everything together. It encapsulates the key logic for the lifecycle of the model such as training, validation and inference.

The learner object will take the databunch created earlier as as input alongwith some of the other parameters such as location for one of the pretrained models, FP16 training, multi_gpu and multi_label options.

The learner class contains the logic for training loop, validation loop, and optimizer strategies. This help the developers focus on their custom use-cases without worrying about these repetitive activities.

At the same time the learner object is flexible enough to be customized either via using flexible parameters or by creating a subclass of BertLearner and redefining relevant methods.

learner = BertLMLearner.from_pretrained_model(
							dataBunch=databunch_lm,
							pretrained_path=args.model_name,
							output_dir=MODEL_PATH,
							metrics=[],
							device=device,
							logger=logger,
							multi_gpu=args.multi_gpu,
							logging_steps=args.logging_steps,
							fp16_opt_level=args.fp16_opt_level)

5. Train the model

learner.fit(epochs=6,
			lr=6e-5,
			validate=True, 	# Evaluate the model after each epoch
			schedule_type="warmup_cosine",
			optimizer_type="lamb")

Fast-Bert now supports LAMB optmizer. Due to the speed of training, we have set LAMB as the default optimizer. You can switch back to AdamW by setting optimizer_type to 'adamw'.

6. Save trained model artifacts

learner.save_model()

Model artefacts will be persisted in the output_dir/'model_out' path provided to the learner object. Following files will be persisted:

File name description
pytorch_model.bin trained model weights
spiece.model sentence tokenizer vocabulary (for xlnet models)
vocab.txt workpiece tokenizer vocabulary (for bert models)
special_tokens_map.json special tokens mappings
config.json model config
added_tokens.json list of new tokens

The pytorch_model.bin contains the finetuned weights and you can point the classification task learner object to this file throgh the finetuned_wgts_path parameter.

Amazon Sagemaker Support

The purpose of this library is to let you train and deploy production grade models. As transformer models require expensive GPUs to train, I have added support for training and deploying model on AWS SageMaker.

The repository contains the docker image and code for building BERT based classification models in Amazon SageMaker.

Please refer to my blog Train and Deploy the Mighty BERT based NLP models using FastBert and Amazon SageMaker that provides detailed explanation on using SageMaker with FastBert.

Citation

Please include a mention of this library and HuggingFace pytorch-transformers library and a link to the present repository if you use this work in a published or open-source project.

Also include my blogs on this topic:

Comments
  • learner.save_model gives KeyError while saving tokenizer/vocab file

    learner.save_model gives KeyError while saving tokenizer/vocab file

    I'm trying to run the multilabel classification model and while saving the model it give me an error on vocab file learner.save_model() gives below error: image

    Is this because I have not specified some path or because I'm not using a pretrained model path from local as in sample notebook.

    My learner config is as below: image

    DataBunchConfig as below: image

    Any help appreciated. Thanks!

    opened by mohammedayub44 17
  • notebook not working out of the box

    notebook not working out of the box

    I'm trying to just get the included toxicity notebook to work from a fresh clone and am having some issues:

    1. Out of the box, the data & labels directory are pointing to the wrong place and the DataBunch is using filenames that are not part of the repo. These are fixed easily enough.

    2. It would help if there was a pointer to where to get the PyTorch pretrained model uncased_L-12_H-768_A-12. There is a Google download which will not work with the from_pretrained_model cell:

    FileNotFoundError: [Errno 2] No such file or directory: '../../bert/bert-models/uncased_L-12_H-768_A-12/pytorch_model.bin'
    

    I have been able to get past this step by instead of using 'bert-base-uncased' instead of BERT_PRETRAINED_PATH as the model spec in the tokenizer and from_pretrained_model steps.

    1. Once I get everything loaded, RuntimeError: CUDA out of memory. Tried to allocate 96.00 MiB (GPU 0; 7.43 GiB total capacity; 6.91 GiB already allocated; 10.94 MiB free; 24.36 MiB cached)

    This is a standard 8G GPU compute engine instance on GCP. Advice on how to not run out of memory would help the tutorial a lot.

    opened by mschmill 17
  • Argmax unexpected key and Cant convert Cuda tensor to Numpy error

    Argmax unexpected key and Cant convert Cuda tensor to Numpy error

    Hi I am facing the issue below. I have installed fast-bert using pip and just copied the code from the readme. Any suggestions on how to fix?

    model/tensorboard
    Traceback (most recent call last):
      File "/usr1/home/rjoshi2/envs/myenv/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 61, in _wrapfunc
        return bound(*args, **kwds)
    TypeError: argmax() got an unexpected keyword argument 'axis'
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "train_bert.py", line 67, in <module>
        main()
      File "train_bert.py", line 62, in main
        optimizer_type="lamb")
      File "/usr1/home/rjoshi2/envs/myenv/lib/python3.7/site-packages/fast_bert/learner_cls.py", line 406, in fit
        results = self.validate()
      File "/usr1/home/rjoshi2/envs/myenv/lib/python3.7/site-packages/fast_bert/learner_cls.py", line 524, in validate
        all_logits, all_labels
      File "/usr1/home/rjoshi2/envs/myenv/lib/python3.7/site-packages/fast_bert/metrics.py", line 15, in accuracy
        outputs = np.argmax(y_pred, axis=1)
      File "<__array_function__ internals>", line 6, in argmax
      File "/usr1/home/rjoshi2/envs/myenv/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 1153, in argmax
        return _wrapfunc(a, 'argmax', axis=axis, out=out)
      File "/usr1/home/rjoshi2/envs/myenv/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 70, in _wrapfunc
        return _wrapit(obj, method, *args, **kwds)
      File "/usr1/home/rjoshi2/envs/myenv/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 47, in _wrapit
        result = getattr(asarray(obj), method)(*args, **kwds)
      File "/usr1/home/rjoshi2/envs/myenv/lib/python3.7/site-packages/numpy/core/_asarray.py", line 85, in asarray
        return array(a, dtype, copy=False, order=order)
      File "/usr1/home/rjoshi2/envs/myenv/lib/python3.7/site-packages/torch/tensor.py", line 433, in __array__
        return self.numpy()
    TypeError: can't convert CUDA tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.
    
    
    opened by rishabhjoshi 8
  • BertDataBunch' object has no attribute 'model_type'

    BertDataBunch' object has no attribute 'model_type'

    I have been following the tutorials concerning Fast-Bert: https://pypi.org/project/fast-bert/ https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/discussion/92668

    My goal is to do binary text classifictation. Therefore, my label.csv has only two labels and I set multi_label to False.

    When executing BertLearner.from_pretrained_model, I am receiving the following error:

    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    <ipython-input-240-ef4cead1d6f0> in <module>
         16                                             loss_scale = args['loss_scale'],
         17                                             multi_gpu = True,
    ---> 18                                             multi_label = False)
    
    ~/.local/lib/python3.6/site-packages/fast_bert/learner_cls.py in from_pretrained_model(dataBunch, pretrained_path, output_dir, metrics, device, logger, finetuned_wgts_path, multi_gpu, is_fp16, loss_scale, warmup_steps, fp16_opt_level, grad_accumulation_steps, multi_label, max_grad_norm, adam_epsilon, logging_steps, freeze_transformer_layers)
        131         model_state_dict = None
        132 
    --> 133         model_type = dataBunch.model_type
        134 
        135         if torch.cuda.is_available():
    
    AttributeError: 'BertDataBunch' object has no attribute 'model_type'
    

    What I have tried so far is including model_type = 'bert' to the BertDataBunch command. This has not helped so far. I am quite sure that my .csv's are in the right format, but of course, this could also be one source of the problem. PATH and imported modules should be fine.

    Attached you find my code:

    from pytorch_pretrained_bert.tokenization import BertTokenizer
    from fast_bert.data import BertDataBunch
    
    # Default args. If GPU runs out of memory while training, decrease training
    # batch size
    args = Box({
        "run_text": "tweet sentiment",
        "task_name": "Tweet Sentiment",
        "max_seq_length": 512,
        "do_lower_case": True,
        "train_batch_size": 8,
        "learning_rate": 6e-5,
        "num_train_epochs": 12.0,
        "warmup_proportion": 0.002,
        "local_rank": -1,
        "gradient_accumulation_steps": 1,
        "fp16": True,
        "loss_scale": 128
    })
    
    device = torch.device('cuda')
    
    # check if multiple GPUs are available
    if torch.cuda.device_count() > 1:
        multi_gpu = True
    else:
        multi_gpu = False
    
    # The tokenizer object is used to split the text into tokens used in training
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case = args['do_lower_case'])
        
    # Databunch    
    databunch = BertDataBunch(DATA_PATH, LABEL_PATH,
                              tokenizer = tokenizer,
                              train_file = 'X_train.csv', 
                              val_file = 'X_test.csv', 
                              label_file = 'label.csv',
                              text_col = 'text',
                              label_col = 'label',
                              bs = args['train_batch_size'], 
                              maxlen = args['max_seq_length'], 
                              multi_gpu = True, 
                              multi_label = False,
                              model_type = 'bert')
    
    databunch.save()
    num_labels = len(databunch.labels)
    num_labels
    
    # Set logger
    import logging
    import sys
    
    logfile = str(LOG_PATH/'log-{}-{}.txt'.format(run_start_time, args["run_text"]))
    
    logging.basicConfig(
        level=logging.INFO,
        format='%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
        datefmt='%m/%d/%Y %H:%M:%S',
        handlers=[
            logging.FileHandler(logfile),
            logging.StreamHandler(sys.stdout)
        ])
    
    logger = logging.getLogger()
    logger.info(args)
    

    When executing this field, the error happens:

    from fast_bert.learner_cls import BertLearner
    from fast_bert.metrics import accuracy
    
    # Choose the metrics used for the error function in training
    metrics = []
    metrics.append({'name': 'accuracy', 'function': accuracy})
    
    learner = BertLearner.from_pretrained_model(databunch, 
                                                pretrained_path = "bert-base-uncased", 
                                                metrics = metrics, 
                                                device = device,
                                                logger = logger, 
                                                output_dir = OUTPUT_DIR,
                                                finetuned_wgts_path = None, 
                                                is_fp16 = args['fp16'], 
                                                loss_scale = args['loss_scale'],
                                                multi_gpu = True,
                                                multi_label = False)
    

    Thank you for your help!

    opened by JRatschat 7
  • Binary text classification: The size of tensor a (2) must match the size of tensor b (39) at non-singleton dimension 1

    Binary text classification: The size of tensor a (2) must match the size of tensor b (39) at non-singleton dimension 1

    Hello,

    I'm working on binary text classification with CamemBert using fast-bert.

    When I run the code below

    from fast_bert.data_cls import BertDataBunch from fast_bert.learner_cls import BertLearner

    databunch = BertDataBunch(DATA_PATH,LABEL_PATH, tokenizer='camembert-base', train_file='train.csv', val_file='val.csv', label_file='labels.csv', text_col='text', label_col='label', batch_size_per_gpu=8, max_seq_length=512, multi_gpu=multi_gpu, multi_label=False, model_type='camembert-base')

    learner = BertLearner.from_pretrained_model( databunch, pretrained_path='camembert-base', #'/content/drive/My Drive/model/model_out' metrics=metrics, device=device_cuda, logger=logger, output_dir=OUTPUT_DIR, finetuned_wgts_path=None, #WGTS_PATH warmup_steps=300, multi_gpu=multi_gpu, is_fp16=True, multi_label=False, logging_steps=50)

    learner.fit(epochs=10, lr=9e-5, validate=True, schedule_type="warmup_cosine", optimizer_type="adamw") Everything works fine until training. I get this error message when I try to train my model:

    RuntimeError Traceback (most recent call last) in () 3 validate=True, 4 schedule_type="warmup_cosine", ----> 5 optimizer_type="adamw")

    2 frames /usr/local/lib/python3.6/dist-packages/fast_bert/learner_cls.py in fit(self, epochs, lr, validate, return_results, schedule_type, optimizer_type) 421 # Evaluate the model against validation set after every epoch 422 if validate: --> 423 results = self.validate() 424 for key, value in results.items(): 425 self.logger.info(

    /usr/local/lib/python3.6/dist-packages/fast_bert/learner_cls.py in validate(self, quiet, loss_only) 515 for metric in self.metrics: 516 validation_scores[metric["name"]] = metric["function"]( --> 517 all_logits, all_labels 518 ) 519 results.update(validation_scores)

    /usr/local/lib/python3.6/dist-packages/fast_bert/metrics.py in fbeta(y_pred, y_true, thresh, beta, eps, sigmoid) 56 y_pred = (y_pred > thresh).float() 57 y_true = y_true.float() ---> 58 TP = (y_pred * y_true).sum(dim=1) 59 prec = TP / (y_pred.sum(dim=1) + eps) 60 rec = TP / (y_true.sum(dim=1) + eps)

    RuntimeError: The size of tensor a (2) must match the size of tensor b (39) at non-singleton 1

    How can I fix this ?

    opened by NawelAr 6
  • RobertaTokenizer object has no attribute 'add_special_tokens_single_sentence'

    RobertaTokenizer object has no attribute 'add_special_tokens_single_sentence'

    In trying to test out the roberta model I received this error. My setup is the same as in the Fine Tune Model section of the readme.

    transformers==2.0.0 fast-bert==1.4.2

    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    <ipython-input-17-c876b1d42fd6> in <module>
          7     multi_gpu=args.multi_gpu,
          8     model_type=args.model_type,
    ----> 9     logger=logger)
    
    ~/.conda/envs/transclass/lib/python3.7/site-packages/fast_bert/data_lm.py in from_raw_corpus(data_dir, text_list, tokenizer, batch_size_per_gpu, max_seq_length, multi_gpu, test_size, model_type, logger, clear_cache, no_cache)
        152                                model_type=model_type,
        153                                logger=logger,
    --> 154                                clear_cache=clear_cache, no_cache=no_cache)
        155 
        156     def __init__(self, data_dir, tokenizer, train_file='lm_train.txt', val_file='lm_val.txt',
    
    ~/.conda/envs/transclass/lib/python3.7/site-packages/fast_bert/data_lm.py in __init__(self, data_dir, tokenizer, train_file, val_file, batch_size_per_gpu, max_seq_length, multi_gpu, model_type, logger, clear_cache, no_cache)
        209             train_filepath = str(self.data_dir/train_file)
        210             train_dataset = TextDataset(self.tokenizer, train_filepath, cached_features_file,
    --> 211                                         self.logger, block_size=self.tokenizer.max_len_single_sentence)
        212 
        213             self.train_batch_size = self.batch_size_per_gpu * \
    
    ~/.conda/envs/transclass/lib/python3.7/site-packages/fast_bert/data_lm.py in __init__(self, tokenizer, file_path, cache_path, logger, block_size)
        104 
        105             while len(tokenized_text) >= block_size:  # Truncate in block of block_size
    --> 106                 self.examples.append(tokenizer.add_special_tokens_single_sentence(
        107                     tokenized_text[:block_size]))
        108                 tokenized_text = tokenized_text[block_size:]
    
    AttributeError: 'RobertaTokenizer' object has no attribute 'add_special_tokens_single_sentence'
    

    It appears that the RobertaTokenizer has attributes:

    add_special_tokens add_special_tokens_sequence_pair add_special_tokens_single_sequence add_tokens

    But not add_special_tokens_single_sentence.

    It seems this method is quite similar to add_special_tokens_single_sequence, and perhaps that is the intended method.

    opened by gphillips-ema 6
  • KeyError: 'distilroberta-base' | UnboundLocalError: local variable 'file_path' referenced before assignment

    KeyError: 'distilroberta-base' | UnboundLocalError: local variable 'file_path' referenced before assignment

    Step 23/23 : RUN python download_pretrained_models.py --location_dir ./pretraine d_models/ --models bert-base-uncased roberta-base distilbert-base-uncased distil roberta-base ---> Running in ea0f4907e7f3 Namespace(location_dir='./pretrained_models/', models=['bert-base-uncased', 'rob erta-base', 'distilbert-base-uncased', 'distilroberta-base']) model name is bert-base-uncased location is pretrained_models/bert-base-uncased file path is https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncas ed-vocab.txt 100%|██████████| 231508/231508 [00:00<00:00, 23495328.36B/s] https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-pytorch_mo del.bin 100%|█████████▉| 440048640/440473133 [00:08<00:00, 52815268.67B/s]https://s3.ama zonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json 100%|██████████| 440473133/440473133 [00:08<00:00, 50842146.41B/s] 0%| | 0/313 [00:00<?, ?B/s]model name is roberta-base location is pretrained_models/roberta-base 100%|██████████| 313/313 [00:00<00:00, 235567.41B/s] file path is https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-vo cab.json 100%|██████████| 898823/898823 [00:00<00:00, 35036913.95B/s] https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-merges.txt 100%|██████████| 456318/456318 [00:00<00:00, 34051566.76B/s] https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-pytorch_model.b in 100%|██████████| 501200538/501200538 [00:12<00:00, 41648047.00B/s] https://s3.amazonaws.com/models.huggingface.co/bert/roberta-base-config.json 100%|██████████| 473/473 [00:00<00:00, 325337.13B/s] model name is distilbert-base-uncased location is pretrained_models/distilbert-base-uncased file path is https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncas ed-vocab.txt 100%|██████████| 231508/231508 [00:00<00:00, 31093372.52B/s] https://s3.amazonaws.com/models.huggingface.co/bert/distilbert-base-uncased-pyto rch_model.bin 99%|█████████▉| 265882624/267967963 [00:07<00:00, 53809389.38B/s]https://s3.ama zonaws.com/models.huggingface.co/bert/distilbert-base-uncased-config.json 100%|██████████| 267967963/267967963 [00:07<00:00, 35726735.77B/s] 100%|██████████| 492/492 [00:00<00:00, 315341.93B/s] model name is distilroberta-base location is pretrained_models/distilroberta-base Traceback (most recent call last): File "download_pretrained_models.py", line 113, in download_pretrained_files file_path = PRETRAINED_VOCAB_FILES_MAP[model_name] KeyError: 'distilroberta-base'

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last): File "download_pretrained_models.py", line 203, in main() File "download_pretrained_models.py", line 198, in main for item in args.models File "download_pretrained_models.py", line 198, in for item in args.models File "download_pretrained_models.py", line 130, in download_pretrained_files file_path, model_name UnboundLocalError: local variable 'file_path' referenced before assignment The command '/bin/sh -c python download_pretrained_models.py --location_dir ./pr etrained_models/ --models bert-base-uncased roberta-base distilbert-base-uncased distilroberta-base' returned a non-zero code: 1 Error response from daemon: No such image: fluent-sagemaker-fast-bert:1.0-gpu-py 36 The push refers to repository [182918221797.dkr.ecr.us-east-1.amazonaws.com/flue nt-sagemaker-fast-bert] An image does not exist locally with the tag: 182918221797.dkr.ecr.us-east-1.ama zonaws.com/fluent-sagemaker-fast-bert

    opened by emtropyml 5
  • KeyError: None of the keys are in index

    KeyError: None of the keys are in index

    I am getting this error ,

    KeyError: ("None of [Index(['CLASS_1', 'CLASS_2', 'CLASS_3', 'CLASS_4', 'CLASS_5', 'CLASS_6',\n       'CLASS_7', 'CLASS_8', 'CLASS_9', 'CLASS_10', 'CLASS_11', 'CLASS_12',\n       'CLASS_13', 'CLASS_14', 'CLASS_15', 'CLASS_16', 'CLASS_17', 'CLASS_E',\n       'CLASS_V'],\n      dtype='object')] are in the [index]", 'occurred at index 0')
    
    

    When i run the new_toxic_multilabel.ipynb from sample notebook, I am getting this error for command:

    databunch = BertDataBunch(args['data_dir'], LABEL_PATH, args.model_name, train_file='train.csv', val_file='val.csv',
                              test_data='test.csv',
                              text_col="NOTES", label_col=label_cols,
                              batch_size_per_gpu=args['train_batch_size'], max_seq_length=args['max_seq_length'], 
                              multi_gpu=args.multi_gpu, multi_label=True, model_type=args.model_type)
    

    here is my label_col:

    label_cols = ['CLASS_1','CLASS_2','CLASS_3','CLASS_4','CLASS_5','CLASS_6','CLASS_7','CLASS_8','CLASS_9','CLASS_10','CLASS_11','CLASS_12','CLASS_13','CLASS_14','CLASS_15','CLASS_16','CLASS_17','CLASS_E','CLASS_V']
    

    my labels.csv contains the same classes but listed one after the another: labels.csv:

    
    'CLASS_1'
    'CLASS_2'
    'CLASS_3'
    'CLASS_4'
    'CLASS_5'
    'CLASS_6'
    'CLASS_7'
    'CLASS_8'
    'CLASS_9'
    'CLASS_10'
    'CLASS_11'
    'CLASS_12'
    'CLASS_13'
    'CLASS_14'
    'CLASS_15'
    'CLASS_16'
    'CLASS_17'
    'CLASS_E'
    'CLASS_V'
    

    Here is the traceback:

    ---------------------------------------------------------------------------
    KeyError                                  Traceback (most recent call last)
    <ipython-input-13-c5a2ac3a5e99> in <module>
          3                           text_col="NOTES", label_col=label_cols,
          4                           batch_size_per_gpu=args['train_batch_size'], max_seq_length=args['max_seq_length'],
    ----> 5                           multi_gpu=args.multi_gpu, multi_label=True, model_type=args.model_type)
    
    ~/virtualenvs/anaconda3/envs/pytorch/lib/python3.7/site-packages/fast_bert/data_cls.py in __init__(self, data_dir, label_dir, tokenizer, train_file, val_file, test_data, label_file, text_col, label_col, batch_size_per_gpu, max_seq_length, multi_gpu, multi_label, backend, model_type, logger, clear_cache, no_cache)
        352             if os.path.exists(cached_features_file) == False or self.no_cache == True:
        353                 train_examples = processor.get_train_examples(
    --> 354                     train_file, text_col=text_col, label_col=label_col)
        355 
        356             train_dataset = self.get_dataset_from_examples(
    
    ~/virtualenvs/anaconda3/envs/pytorch/lib/python3.7/site-packages/fast_bert/data_cls.py in get_train_examples(self, filename, text_col, label_col, size)
        230             data_df = pd.read_csv(os.path.join(self.data_dir, filename))
        231 
    --> 232             return self._create_examples(data_df, "train", text_col=text_col, label_col=label_col)
        233         else:
        234             data_df = pd.read_csv(os.path.join(self.data_dir, filename))
    
    ~/virtualenvs/anaconda3/envs/pytorch/lib/python3.7/site-packages/fast_bert/data_cls.py in _create_examples(self, df, set_type, text_col, label_col)
        286         else:
        287             return list(df.apply(lambda row: InputExample(guid=row.index, text_a=row[text_col],
    --> 288                                                           label=_get_labels(row, label_col)), axis=1))
        289 
        290 
    
    ~/virtualenvs/anaconda3/envs/pytorch/lib/python3.7/site-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, result_type, args, **kwds)
       6926             kwds=kwds,
       6927         )
    -> 6928         return op.get_result()
       6929 
       6930     def applymap(self, func):
    
    ~/virtualenvs/anaconda3/envs/pytorch/lib/python3.7/site-packages/pandas/core/apply.py in get_result(self)
        184             return self.apply_raw()
        185 
    --> 186         return self.apply_standard()
        187 
        188     def apply_empty_result(self):
    
    ~/virtualenvs/anaconda3/envs/pytorch/lib/python3.7/site-packages/pandas/core/apply.py in apply_standard(self)
        290 
        291         # compute the result using the series generator
    --> 292         self.apply_series_generator()
        293 
        294         # wrap results
    
    ~/virtualenvs/anaconda3/envs/pytorch/lib/python3.7/site-packages/pandas/core/apply.py in apply_series_generator(self)
        319             try:
        320                 for i, v in enumerate(series_gen):
    --> 321                     results[i] = self.f(v)
        322                     keys.append(v.name)
        323             except Exception as e:
    
    ~/virtualenvs/anaconda3/envs/pytorch/lib/python3.7/site-packages/fast_bert/data_cls.py in <lambda>(row)
        286         else:
        287             return list(df.apply(lambda row: InputExample(guid=row.index, text_a=row[text_col],
    --> 288                                                           label=_get_labels(row, label_col)), axis=1))
        289 
        290 
    
    ~/virtualenvs/anaconda3/envs/pytorch/lib/python3.7/site-packages/fast_bert/data_cls.py in _get_labels(row, label_col)
        273         def _get_labels(row, label_col):
        274             if isinstance(label_col, list):
    --> 275                 return list(row[label_col])
        276             else:
        277                 # create one hot vector of labels
    
    ~/virtualenvs/anaconda3/envs/pytorch/lib/python3.7/site-packages/pandas/core/series.py in __getitem__(self, key)
       1111             key = check_bool_indexer(self.index, key)
       1112 
    -> 1113         return self._get_with(key)
       1114 
       1115     def _get_with(self, key):
    
    ~/virtualenvs/anaconda3/envs/pytorch/lib/python3.7/site-packages/pandas/core/series.py in _get_with(self, key)
       1153             # handle the dup indexing case (GH 4246)
       1154             if isinstance(key, (list, tuple)):
    -> 1155                 return self.loc[key]
       1156 
       1157             return self.reindex(key)
    
    ~/virtualenvs/anaconda3/envs/pytorch/lib/python3.7/site-packages/pandas/core/indexing.py in __getitem__(self, key)
       1422 
       1423             maybe_callable = com.apply_if_callable(key, self.obj)
    -> 1424             return self._getitem_axis(maybe_callable, axis=axis)
       1425 
       1426     def _is_scalar_access(self, key: Tuple):
    
    ~/virtualenvs/anaconda3/envs/pytorch/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
       1837                     raise ValueError("Cannot index with multidimensional key")
       1838 
    -> 1839                 return self._getitem_iterable(key, axis=axis)
       1840 
       1841             # nested tuple slicing
    
    ~/virtualenvs/anaconda3/envs/pytorch/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_iterable(self, key, axis)
       1131         else:
       1132             # A collection of keys
    -> 1133             keyarr, indexer = self._get_listlike_indexer(key, axis, raise_missing=False)
       1134             return self.obj._reindex_with_indexers(
       1135                 {axis: [keyarr, indexer]}, copy=True, allow_dups=True
    
    ~/virtualenvs/anaconda3/envs/pytorch/lib/python3.7/site-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis, raise_missing)
       1090 
       1091         self._validate_read_indexer(
    -> 1092             keyarr, indexer, o._get_axis_number(axis), raise_missing=raise_missing
       1093         )
       1094         return keyarr, indexer
    
    ~/virtualenvs/anaconda3/envs/pytorch/lib/python3.7/site-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing)
       1175                 raise KeyError(
       1176                     "None of [{key}] are in the [{axis}]".format(
    -> 1177                         key=key, axis=self.obj._get_axis_name(axis)
       1178                     )
       1179                 )
    
    KeyError: ("None of [Index(['CLASS_1', 'CLASS_2', 'CLASS_3', 'CLASS_4', 'CLASS_5', 'CLASS_6',\n       'CLASS_7', 'CLASS_8', 'CLASS_9', 'CLASS_10', 'CLASS_11', 'CLASS_12',\n       'CLASS_13', 'CLASS_14', 'CLASS_15', 'CLASS_16', 'CLASS_17', 'CLASS_E',\n       'CLASS_V'],\n      dtype='object')] are in the [index]", 'occurred at index 0')
    
    

    What is the issue?

    opened by adiv5 5
  • use_fast=True not working after upgrade to transformers v2.10.0

    use_fast=True not working after upgrade to transformers v2.10.0

    On upgrading to transformers==2.10.0, when instantiating a tokenizer, the vocabulary file is not saved after training. A TypeError is returned when trying to save the tokenizer after training (i.e. on calling data.tokenizer.save_pretrained(path) in learner_util.py).

    I've traced this to line 367 in data_cls.py: https://github.com/kaushaltrivedi/fast-bert/blob/77f09adc7bc2706e0c7e3b8cdd09cb6ddd66ae28/fast_bert/data_cls.py#L367

    if I comment out the use_fast argument, the tokenizer file can be saved correctly, i.e: tokenizer = AutoTokenizer.from_pretrained(tokenizer)#, use_fast=True)

    opened by lingdoc 4
  • The current BertClassificationPredictor has a bug in model_path parameter

    The current BertClassificationPredictor has a bug in model_path parameter

    The current BertClassificationPredictor has a bug in model_path parameter when it tries to create tokenizer from AutoTokenizer. It would be good to fix it but also let an option to provide a custom tokenizer.

    opened by markovivl 4
  • ImportError: cannot import name 'ConstantLRSchedule'

    ImportError: cannot import name 'ConstantLRSchedule'

    Doing the following steps to install the fast-bert:

    1. pip install fast-bert
    2. git clone https://github.com/NVIDIA/apex
    3. cd apex
    4. pip install -v --no-cache-dir ./
    5. create train.py
    from fast_bert.data_cls import BertDataBunch
    
    DATA_PATH = 'data'
    LABEL_PATH = 'data'
    
    databunch = BertDataBunch(DATA_PATH, LABEL_PATH,
                              tokenizer='bert-base-multilingual-uncased',
                              train_file='train.csv',
                              val_file='val.csv',
                              label_file='labels.csv',
                              text_col='text',
                              label_col='label',
                              batch_size_per_gpu=16,
                              max_seq_length=512,
                              multi_gpu=True,
                              multi_label=False,
                              model_type='bert')
    
    1. run it

    Getting the following error:

    Traceback (most recent call last):
      File "/home/kleysonr/.vscode/extensions/ms-python.python-2019.11.50794/pythonFiles/ptvsd_launcher.py", line 43, in <module>
        main(ptvsdArgs)
      File "/home/kleysonr/.vscode/extensions/ms-python.python-2019.11.50794/pythonFiles/lib/python/old_ptvsd/ptvsd/__main__.py", line 432, in main
        run()
      File "/home/kleysonr/.vscode/extensions/ms-python.python-2019.11.50794/pythonFiles/lib/python/old_ptvsd/ptvsd/__main__.py", line 316, in run_file
        runpy.run_path(target, run_name='__main__')
      File "/usr/lib/python3.6/runpy.py", line 263, in run_path
        pkg_name=pkg_name, script_name=fname)
      File "/usr/lib/python3.6/runpy.py", line 96, in _run_module_code
        mod_name, mod_spec, pkg_name, script_name)
      File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
        exec(code, run_globals)
      File "/data/dev/python/mestrado/aula9/train.py", line 1, in <module>
        from fast_bert.data_cls import BertDataBunch
      File "/home/kleysonr/.virtualenvs/fastai/lib/python3.6/site-packages/fast_bert/__init__.py", line 5, in <module>
        from .learner_cls import BertLearner
      File "/home/kleysonr/.virtualenvs/fastai/lib/python3.6/site-packages/fast_bert/learner_cls.py", line 3, in <module>
        from .learner_util import Learner
      File "/home/kleysonr/.virtualenvs/fastai/lib/python3.6/site-packages/fast_bert/learner_util.py", line 4, in <module>
        from transformers import (ConstantLRSchedule,
    ImportError: cannot import name 'ConstantLRSchedule'
    

    Installed python modules:

    $ pip freeze
    apex==0.1
    beautifulsoup4==4.8.1
    blis==0.4.1
    boto3==1.10.28
    botocore==1.13.28
    Bottleneck==1.3.1
    catalogue==0.0.8
    certifi==2019.11.28
    chardet==3.0.4
    Click==7.0
    cycler==0.10.0
    cymem==2.0.3
    dataclasses==0.7
    docutils==0.15.2
    fast-bert==1.4.4
    fastai==1.0.59
    fastprogress==0.1.22
    idna==2.8
    importlib-metadata==0.23
    jmespath==0.9.4
    joblib==0.14.0
    kiwisolver==1.1.0
    matplotlib==3.1.2
    more-itertools==7.2.0
    murmurhash==1.0.2
    numexpr==2.7.0
    numpy==1.17.4
    nvidia-ml-py3==7.352.0
    packaging==19.2
    pandas==0.25.3
    Pillow==6.2.1
    plac==1.1.3
    preshed==3.0.2
    protobuf==3.11.0
    pyparsing==2.4.5
    python-dateutil==2.8.1
    pytorch-lamb==1.0.0
    pytz==2019.3
    PyYAML==5.1.2
    regex==2019.11.1
    requests==2.22.0
    s3transfer==0.2.1
    sacremoses==0.0.35
    scikit-learn==0.21.3
    scipy==1.3.3
    sentencepiece==0.1.83
    six==1.13.0
    sklearn==0.0
    soupsieve==1.9.5
    spacy==2.2.3
    srsly==0.2.0
    tensorboardX==1.9
    thinc==7.3.1
    torch==1.3.1
    torchvision==0.4.2
    tqdm==4.39.0
    transformers==2.2.0
    urllib3==1.25.7
    wasabi==0.4.0
    zipp==0.6.0
    
    opened by kleysonr 4
  • Updated data.py and data_cls.py to work with xlsx data files

    Updated data.py and data_cls.py to work with xlsx data files

    This hotfix allows xlsx files as data files for training and evaluation. It simply checks whether xlsx is in the filename and uses the read_excel() import function from the pandas library. It may require openpyxl to be installed via pip or another package manager.

    Addresses #311 (possibly others), whereby imports via read_csv() can result in errors due to formatting problems.

    opened by lingdoc 0
  • Updated learner_util.py save_model() to work with an alternate path in string format

    Updated learner_util.py save_model() to work with an alternate path in string format

    Currently when a path string is provided to learner.save_model(), a directory is not created. This hotfix converts the string to a Path object so that a new directory can be created.

    opened by lingdoc 0
  • DtypeWarning: Columns (0,1) have mixed types. Specify dtype option on import or set low_memory=False

    DtypeWarning: Columns (0,1) have mixed types. Specify dtype option on import or set low_memory=False

    Hello, I am quite new on the topic, sorry if it's a false issue.

    When loading with BertDataBunch, I got this warning:

    lib/python3.9/site-packages/fast_bert/data_cls.py:231: DtypeWarning: Columns (0,1) have mixed types. Specify dtype option on import or set low_memory=False.
      data_df = pd.read_csv(os.path.join(self.data_dir, filename))
    

    I already have this sort of issue with panda in my code, but with BertDataBunch I can't find a way to set dtype option ? Installed fast-bert yesterday, so latest version I guess

    databunch = BertDataBunch(DATA_PATH, LABEL_PATH,
                                  tokenizer='camembert-base',
                                  train_file='train_set.csv',
                                  val_file='val_set.csv',
                                  label_file='labels.txt',
                                  text_col='source_clean',
                                  label_col=['aaa', 'bbb', 'ccc','ddd', 'eee'],
                                  batch_size_per_gpu=16,
                                  max_seq_length=512,
                                  multi_gpu=False,
                                  multi_label=True,
                                  model_type='camembert-base')
    
    opened by mathieuchateau 2
  • TypeError: forward() got an unexpected keyword argument 'masked_lm_labels'

    TypeError: forward() got an unexpected keyword argument 'masked_lm_labels'

    learner.fit(epochs=1, r=6e-5, validate=True, # Evaluate the model after each epoch schedule_type="warmup_cosine", optimizer_type="lamb") image Hi, following the official tutorial ("Language Model Fine-tuning) , i get the following error presented in screenshot, while running .fit function. image

    opened by FirstGalacticEmpire 0
  • [BUG] AttributeError: 'RobertaTokenizer' object has no attribute 'max_len'

    [BUG] AttributeError: 'RobertaTokenizer' object has no attribute 'max_len'

    args = Box({ "seed": 42, "task_name": 'Medical_language_modelling', "model_name": 'roberta-base', "model_type": 'roberta', "train_batch_size": 16, "learning_rate": 4e-5, "num_train_epochs": 20, "fp16": True, "fp16_opt_level": "O2", "warmup_steps": 1000, "logging_steps": 0, "max_seq_length": 512, "multi_gpu": True if torch.cuda.device_count() > 1 else False })

    databunch_lm = BertLMDataBunch.from_raw_corpus( data_dir=Path("./raw_text/"), text_list=list_of_files, tokenizer=args.model_name, batch_size_per_gpu=args.train_batch_size, max_seq_length=args.max_seq_length, multi_gpu=args.multi_gpu, model_type=args.model_type, logger=logger)

    When running the following line I get the following error: "AttributeError: 'RobertaTokenizer' object has no attribute 'max_len'" Which I suspect is due to update, that caused the RobertaTokenizer to lost its attribute max_len.

    opened by FirstGalacticEmpire 0
  • [Suggestion] Pin requirement versions (specifically python-box)

    [Suggestion] Pin requirement versions (specifically python-box)

    Hello, I am the developer of python-box and see that it is a requirement in this repo and has not been version pinned. I suggest that you pin it to the max known compatible version in your requirements.txt and/or setup.py file(s):

    python-box[all]~=5.4  
    

    Or without extra dependencies

    python-box~=5.4
    

    Using ~=5.0 (or any minor version) will lock it to the major version of 5 and minimum of minor version specified. If you add a bugfix space for 5.4.0 it would lock it to the minor version 5.4.*.

    The next major release of Box is right around the corner, and while it has many improvements, I want to ensure you have a smooth transition by being able to test at your own leisure to ensure your standard user cases do not run into any issues. I am keeping track of major changes, so please check there as a quick overview of any differences.

    To test new changes, try out the release candidate:

    pip install python-box[all]~=6.0.0rc4
    opened by cdgriffith 0
Releases(v1.8.0)
  • v1.8.0(Jul 9, 2020)

  • v1.7.0(Apr 14, 2020)

    We have switched to Auto-model for Multi-class classification. This would let you train any pretrained model architecture for text classification.

    Source code(tar.gz)
    Source code(zip)
  • v1.6.0(Dec 22, 2019)

    Now supports the initial version of Abstractive Summarisation inference, fast-bert style

    In a not so future release, you will be able to use your custom language model fine-tuned on custom corpus for the encoder model.

    Source code(tar.gz)
    Source code(zip)
  • v1.5.1(Dec 14, 2019)

  • v1.5.0(Nov 28, 2019)

    Three new models have been added in v1.5.0

    • ALBERT (Pytorch) (from Google Research and the Toyota Technological Institute at Chicago) released with the paper ALBERT: A Lite BERT for Self-supervised Learning of Language Representations, by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut.
      CamemBERT (Pytorch) (from Facebook AI Research, INRIA, and La Sorbonne Université), as the first large-scale Transformer language model. Released alongside the paper CamemBERT: a Tasty French Language Model by Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suarez, Yoann Dupont, Laurent Romary, Eric Villemonte de la Clergerie, Djame Seddah, and Benoît Sagot. It was added by @louismartin with the help of @julien-c.
      DistilRoberta (Pytorch) from @VictorSanh as the third distilled model after DistilBERT and DistilGPT-2.
  • Source code(tar.gz)
    Source code(zip)
Owner
Utterworks
Utterworks
Synthetic data for the people.

zpy: Synthetic data in Blender. Website • Install • Docs • Examples • CLI • Contribute • Licence Abstract Collecting, labeling, and cleaning data for

Zumo Labs 253 Dec 21, 2022
Watson Natural Language Understanding and Knowledge Studio

Material de demonstração dos serviços: Watson Natural Language Understanding e Knowledge Studio Visão Geral: https://www.ibm.com/br-pt/cloud/watson-na

Vanderlei Munhoz 4 Oct 24, 2021
Vad-sli-asr - A Python scripts for a speech processing pipeline with Voice Activity Detection (VAD)

VAD-SLI-ASR Python scripts for a speech processing pipeline with Voice Activity

Dynamics of Language 14 Dec 09, 2022
Checking spelling of form elements

Checking spelling of form elements. You can check the source files of external workflows/reports and configuration files

СКБ Контур (команда 1с) 15 Sep 12, 2022
The official repository of the ISBI 2022 KNIGHT Challenge

KNIGHT The official repository holding the data for the ISBI 2022 KNIGHT Challenge About The KNIGHT Challenge asks teams to develop models to classify

Nicholas Heller 4 Jan 22, 2022
Easy, fast, effective, and automatic g-code compression!

Getting to the meat of g-code. Easy, fast, effective, and automatic g-code compression! MeatPack nearly doubles the effective data rate of a standard

Scott Mudge 97 Nov 21, 2022
CoSENT 比Sentence-BERT更有效的句向量方案

CoSENT 比Sentence-BERT更有效的句向量方案

苏剑林(Jianlin Su) 201 Dec 12, 2022
Perform sentiment analysis and keyword extraction on Craigslist listings

craiglist-helper synopsis Perform sentiment analysis and keyword extraction on Craigslist listings Background I love Craigslist. I've found most of my

Mark Musil 1 Nov 08, 2021
This is a general repo that helps you develop fast/effective NLP classifiers using Huggingface

NLP Classifier Introduction This project trains a bert model on any NLP classifcation model. And uses the model in make predictions on new data using

Abdullah Tarek 3 Mar 11, 2022
Search Git commits in natural language

NaLCoS - NAtural Language COmmit Search Search commit messages in your repository in natural language. NaLCoS (NAtural Language COmmit Search) is a co

Pushkar Patel 50 Mar 22, 2022
In this Notebook I've build some machine-learning and deep-learning to classify corona virus tweets, in both multi class classification and binary classification.

Hello, This Notebook Contains Example of Corona Virus Tweets Multi Class Classification. - Classes is: Extremely Positive, Positive, Extremely Negativ

Khaled Tofailieh 3 Dec 06, 2022
Plugin repository for Macast

Macast-plugins Plugin repository for Macast. How to use third-party player plugin Download Macast from GitHub Release. Download the plugin you want fr

109 Jan 04, 2023
Source code for the paper "TearingNet: Point Cloud Autoencoder to Learn Topology-Friendly Representations"

TearingNet: Point Cloud Autoencoder to Learn Topology-Friendly Representations Created by Jiahao Pang, Duanshun Li, and Dong Tian from InterDigital In

InterDigital 21 Dec 29, 2022
Ukrainian TTS (text-to-speech) using Coqui TTS

title emoji colorFrom colorTo sdk app_file pinned Ukrainian TTS 🐸 green green gradio app.py false Ukrainian TTS 📢 🤖 Ukrainian TTS (text-to-speech)

Yurii Paniv 85 Dec 26, 2022
Code for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned Language Models in the wild .

🌳 Fingerprinting Fine-tuned Language Models in the wild This is the code and dataset for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned La

LCS2-IIITDelhi 5 Sep 13, 2022
The Sudachi synonym dictionary in Solar format.

solr-sudachi-synonyms The Sudachi synonym dictionary in Solar format. Summary Run a script that checks for updates to the Sudachi dictionary every hou

Karibash 3 Aug 19, 2022
Words_And_Phrases - Just a repo for useful words and phrases that might come handy in some scenarios. Feel free to add yours

Words_And_Phrases Just a repo for useful words and phrases that might come handy in some scenarios. Feel free to add yours Abbreviations Abbreviation

Subhadeep Mandal 1 Feb 01, 2022
Repository for fine-tuning Transformers 🤗 based seq2seq speech models in JAX/Flax.

Seq2Seq Speech in JAX A JAX/Flax repository for combining a pre-trained speech encoder model (e.g. Wav2Vec2, HuBERT, WavLM) with a pre-trained text de

Sanchit Gandhi 21 Dec 14, 2022
Code for CodeT5: a new code-aware pre-trained encoder-decoder model.

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation This is the official PyTorch implementation

Salesforce 564 Jan 08, 2023
IEEEXtreme15.0 Questions And Answers

IEEEXtreme15.0 Questions And Answers IEEEXtreme is a global challenge in which teams of IEEE Student members – advised and proctored by an IEEE member

Dilan Perera 15 Oct 24, 2022