Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

Overview

NeuroNER

Build Status

NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com.

This page gives step-by-step instructions to install and use NeuroNER.

Table of Contents

Requirements

NeuroNER relies on Python 3, TensorFlow 1.0+, and optionally on BRAT:

  • Python 3: NeuroNER does not work with Python 2.x. On Windows, it has to be Python 3.6 64-bit or later.
  • TensorFlow is a library for machine learning. NeuroNER uses it for its NER engine, which is based on neural networks. Official website: https://www.tensorflow.org
  • BRAT (optional) is a web-based annotation tool. It only needs to be installed if you wish to conveniently create annotations or view the predictions made by NeuroNER. Official website: http://brat.nlplab.org

Installation

For GPU support, GPU requirements for Tensorflow must be satisfied. If your system does not meet these requirements, you should use the CPU version. To install neuroner:

# For CPU support (no GPU support):
pip3 install pyneuroner[cpu]

# For GPU support:
pip3 install pyneuroner[gpu]

You will also need to download some support packages.

  1. The English language module for Spacy:
# Download the SpaCy English module
python -m spacy download en
  1. Download word embeddings from http://neuroner.com/data/word_vectors/glove.6B.100d.zip, unzip them to the folder ./data/word_vectors
# Get word embeddings
wget -P data/word_vectors http://neuroner.com/data/word_vectors/glove.6B.100d.zip
unzip data/word_vectors/glove.6B.100d.zip -d data/word_vectors/
  1. Load sample datasets. These can be loaded by calling the neuromodel.fetch_data() function from a Python interpreter or with the --fetch_data argument at the command line.
# Load a dataset from the command line
neuroner --fetch_data=conll2003
neuroner --fetch_data=example_unannotated_texts
neuroner --fetch_data=i2b2_2014_deid
# Load a dataset from a Python interpreter
from neuroner import neuromodel
neuromodel.fetch_data('conll2003')
neuromodel.fetch_data('example_unannotated_texts')
neuromodel.fetch_data('i2b2_2014_deid')
  1. Load a pretrained model. The models can be loaded by calling the neuromodel.fetch_model() function from a Python interpreter or with the --fetch_trained_models argument at the command line.
# Load a pre-trained model from the command line
neuroner --fetch_trained_model=conll_2003_en
neuroner --fetch_trained_model=i2b2_2014_glove_spacy_bioes
neuroner --fetch_trained_model=i2b2_2014_glove_stanford_bioes
neuroner --fetch_trained_model=mimic_glove_spacy_bioes
neuroner --fetch_trained_model=mimic_glove_stanford_bioes
# Load a pre-trained model from a Python interpreter
from neuroner import neuromodel
neuromodel.fetch_model('conll_2003_en')
neuromodel.fetch_model('i2b2_2014_glove_spacy_bioes')
neuromodel.fetch_model('i2b2_2014_glove_stanford_bioes')
neuromodel.fetch_model('mimic_glove_spacy_bioes')
neuromodel.fetch_model('mimic_glove_stanford_bioes')

Installing BRAT (optional)

BRAT is a tool that can be used to create, change or view the BRAT-style annotations. For installation and usage instructions, see the BRAT website.

Installing Perl (platform dependent)

Perl is required because the official CoNLL-2003 evaluation script is written in this language: http://strawberryperl.com. For Unix and Mac OSX systems, Perl should already be installed. For Windows systems, you may need to install it.

Using NeuroNER

NeuroNER can either be run from the command line or from a Python interpreter.

Using NeuroNer from a Python interpreter

To use NeuroNER from the command line, create an instance of the neuromodel with your desired arguments, and then call the relevant methods. Additional parameters can be set from a parameters.ini file in the working directory. For example:

from neuroner import neuromodel
nn = neuromodel.NeuroNER(train_model=False, use_pretrained_model=True)

More detail to follow.

Using NeuroNer from the command line

By default NeuroNER is configured to train and test on the CoNLL-2003 dataset. Running neuroner with the default settings starts training on the CoNLL-2003 dataset (the F1-score on the test set should be around 0.90, i.e. on par with state-of-the-art systems). To start the training:

# To use the CPU if you have installed tensorflow, or use the GPU if you have installed tensorflow-gpu:
neuroner

# To use the CPU only if you have installed tensorflow-gpu:
CUDA_VISIBLE_DEVICES="" neuroner

# To use the GPU 1 only if you have installed tensorflow-gpu:
CUDA_VISIBLE_DEVICES=1 neuroner

If you wish to change any of NeuroNER parameters, you can modify the parameters.ini configuration file in your working directory or specify it as an argument.

For example, to reduce the number of training epochs and not use any pre-trained token embeddings:

neuroner --maximum_number_of_epochs=2 --token_pretrained_embedding_filepath=""

To perform NER on some plain texts using a pre-trained model:

neuroner --train_model=False --use_pretrained_model=True --dataset_text_folder=./data/example_unannotated_texts --pretrained_model_folder=./trained_models/conll_2003_en

If a parameter is specified in both the parameters.ini configuration file and as an argument, then the argument takes precedence (i.e., the parameter in parameters.ini is ignored). You may specify a different configuration file with the --parameters_filepath command line argument. The command line arguments have no default value except for --parameters_filepath, which points to parameters.ini.

NeuroNER has 3 modes of operation:

  • training mode (from scratch): the dataset folder must have train and valid sets. Test and deployment sets are optional.
  • training mode (from pretrained model): the dataset folder must have train and valid sets. Test and deployment sets are optional.
  • prediction mode (using pretrained model): the dataset folder must have either a test set or a deployment set.

Adding a new dataset

A dataset may be provided in either CoNLL-2003 or BRAT format. The dataset files and folders should be organized and named as follows:

  • Training set: train.txt file (CoNLL-2003 format) or train folder (BRAT format). It must contain labels.
  • Validation set: valid.txt file (CoNLL-2003 format) or valid folder (BRAT format). It must contain labels.
  • Test set: test.txt file (CoNLL-2003 format) or test folder (BRAT format). It must contain labels.
  • Deployment set: deploy.txt file (CoNLL-2003 format) or deploy folder (BRAT format). It shouldn't contain any label (if it does, labels are ignored).

We provide several examples of datasets:

  • data/conll2003/en: annotated dataset with the CoNLL-2003 format, containing 3 files (train.txt, valid.txt and test.txt).
  • data/example_unannotated_texts: unannotated dataset with the BRAT format, containing 1 folder (deploy/). Note that the BRAT format with no annotation is the same as plain texts.

Using a pretrained model

In order to use a pretrained model, the pretrained_model_folder parameter in the parameters.ini configuration file must be set to the folder containing the pretrained model. The following parameters in the parameters.ini configuration file must also be set to the same values as in the configuration file located in the specified pretrained_model_folder:

use_character_lstm
character_embedding_dimension
character_lstm_hidden_state_dimension
token_pretrained_embedding_filepath
token_embedding_dimension
token_lstm_hidden_state_dimension
use_crf
tagging_format
tokenizer

Sharing a pretrained model

You are highly encouraged to share a model trained on their own datasets, so that other users can use the pretrained model on other datasets. We provide the neuroner/prepare_pretrained_model.py script to make it easy to prepare a pretrained model for sharing. In order to use the script, one only needs to specify the output_folder_name, epoch_number, and model_name parameters in the script.

By default, the only information about the dataset contained in the pretrained model is the list of tokens that appears in the dataset used for training and the corresponding embeddings learned from the dataset.

If you wish to share a pretrained model without providing any information about the dataset (including the list of tokens appearing in the dataset), you can do so by setting

delete_token_mappings = True

when running the script. In this case, it is highly recommended to use some external pre-trained token embeddings and freeze them while training the model to obtain high performance. This can be done by specifying the token_pretrained_embedding_filepath and setting

freeze_token_embeddings = True

in the parameters.ini configuration file during training.

In order to share a pretrained model, please submit a new issue on the GitHub repository.

Using TensorBoard

You may launch TensorBoard during or after the training phase. To do so, run in the terminal from the NeuroNER folder:

tensorboard --logdir=output

This starts a web server that is accessible at http://127.0.0.1:6006 from your web browser.

Citation

If you use NeuroNER in your publications, please cite this paper:

@article{2017neuroner,
  title={{NeuroNER}: an easy-to-use program for named-entity recognition based on neural networks},
  author={Dernoncourt, Franck and Lee, Ji Young and Szolovits, Peter},
  journal={Conference on Empirical Methods on Natural Language Processing (EMNLP)},
  year={2017}
}

The neural network architecture used in NeuroNER is described in this article:

@article{2016deidentification,
  title={De-identification of Patient Notes with Recurrent Neural Networks},
  author={Dernoncourt, Franck and Lee, Ji Young and Uzuner, Ozlem and Szolovits, Peter},
  journal={Journal of the American Medical Informatics Association (JAMIA)},
  year={2016}
}
Comments
  • Trouble running main.py missing module attribute-distutils-util

    Trouble running main.py missing module attribute-distutils-util

    Hey, thanks for the code. Unfortunately I am having difficulty running the pretrained conll_2003_en model. mu parameters.ini file looks like this:

    
    #----- Possible modes of operation -----------------------------------------------------------------------------------------------------------------#
    # training mode (from scratch): set train_model to True, and use_pretrained_model to False (if training from scratch).                        #
    #				 				Must have train and valid sets in the dataset_text_folder, and test and deployment sets are optional.               #
    # training mode (from pretrained model): set train_model to True, and use_pretrained_model to True (if training from a pretrained model).     #
    #				 						 Must have train and valid sets in the dataset_text_folder, and test and deployment sets are optional.      #
    # prediction mode (using pretrained model): set train_model to False, and use_pretrained_model to True.                                       #
    #											Must have either a test set or a deployment set.                                                        #
    # NOTE: Whenever use_pretrained_model is set to True, pretrained_model_folder must be set to the folder containing the pretrained model to use, and #
    # 		model.ckpt, dataset.pickle and parameters.ini must exist in the same folder as the checkpoint file.                                         #
    #---------------------------------------------------------------------------------------------------------------------------------------------------#
    
    [mode]
    # At least one of use_pretrained_model and train_model must be set to True.
    train_model = False
    use_pretrained_model = True
    pretrained_model_folder = ../trained_models/conll_2003_en
    
    [dataset]
    dataset_text_folder = ../data/conll2003/en
    
    # main_evaluation_mode should be either 'conll', 'bio', 'token', or 'binary'. ('conll' is entity-based)
    # It determines which metric to use for early stopping, displaying during training, and plotting F1-score vs. epoch.
    main_evaluation_mode = conll
    
    output_folder = ../output
    
    #---------------------------------------------------------------------------------------------------------------------#
    # The parameters below are for advanced users. Their default values should yield good performance in most cases.      #
    #---------------------------------------------------------------------------------------------------------------------#
    
    [ann]
    use_character_lstm = True
    character_embedding_dimension = 25
    character_lstm_hidden_state_dimension = 25
    
    # In order to use random initialization instead, set token_pretrained_embedding_filepath to empty string, as below:
    # token_pretrained_embedding_filepath =
    token_pretrained_embedding_filepath = ../data/word_vectors/glove.6B.100d.txt
    token_embedding_dimension = 100
    token_lstm_hidden_state_dimension = 100
    
    use_crf = True
    
    [training]
    patience = 10
    maximum_number_of_epochs = 100
    
    # optimizer should be either 'sgd', 'adam', or 'adadelta'
    optimizer = sgd
    learning_rate = 0.005
    # gradients will be clipped above |gradient_clipping_value| and below -|gradient_clipping_value|, if gradient_clipping_value is non-zero
    # (set to 0 to disable gradient clipping)
    gradient_clipping_value = 5.0
    
    # dropout_rate should be between 0 and 1
    dropout_rate = 0.5
    
    # Upper bound on the number of CPU threads NeuroNER will use
    number_of_cpu_threads = 8
    
    # Upper bound on the number of GPU NeuroNER will use
    # If number_of_gpus > 0, you need to have installed tensorflow-gpu
    number_of_gpus = 0
    
    [advanced]
    experiment_name = test
    
    # tagging_format should be either 'bioes' or 'bio'
    tagging_format = bioes
    
    # tokenizer should be either 'spacy' or 'stanford'. The tokenizer is only used when the original data is provided only in BRAT format.
    # - 'spacy' refers to spaCy (https://spacy.io). To install spacy: pip install -U spacy
    # - 'stanford' refers to Stanford CoreNLP (https://stanfordnlp.github.io/CoreNLP/). Stanford CoreNLP is written in Java: to use it one has to start a
    #              Stanford CoreNLP server, which can tokenize sentences given on the fly. Stanford CoreNLP is portable, which means that it can be run
    #              without any installation.
    #              To download Stanford CoreNLP: https://stanfordnlp.github.io/CoreNLP/download.html
    #              To run Stanford CoreNLP, execute in the terminal: `java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 50000`
    #              By default Stanford CoreNLP is in English. To use it in other languages, see: https://stanfordnlp.github.io/CoreNLP/human-languages.html
    #              Stanford CoreNLP 3.6.0 and higher requires Java 8. We have tested NeuroNER with Stanford CoreNLP 3.6.0.
    tokenizer = spacy
    # spacylanguage should be either 'de' (German), 'en' (English) or 'fr' (French). (https://spacy.io/docs/api/language-models)
    # To install the spaCy language: `python -m spacy.de.download`; or `python -m spacy.en.download`; or `python -m spacy.fr.download`
    spacylanguage = en
    
    # If remap_unknown_tokens is set to True, map to UNK any token that hasn't been seen in neither the training set nor the pre-trained token embeddings.
    remap_unknown_tokens_to_unk = True
    
    # If load_only_pretrained_token_embeddings is set to True, then token embeddings will only be loaded if it exists in token_pretrained_embedding_filepath
    # or in pretrained_model_checkpoint_filepath, even for the training set.
    load_only_pretrained_token_embeddings = False
    
    # If load_all_pretrained_token_embeddings is set to True, then all pretrained token embeddings will be loaded even for the tokens that do not appear in the dataset.
    load_all_pretrained_token_embeddings = False
    
    # If check_for_lowercase is set to True, the lowercased version of each token will also be checked when loading the pretrained embeddings.
    # For example, if the token 'Boston' does not exist in the pretrained embeddings, then it is mapped to the embedding of its lowercased version 'boston',
    # if it exists among the pretrained embeddings.
    check_for_lowercase = True
    
    # If check_for_digits_replaced_with_zeros is set to True, each token with digits replaced with zeros will also be checked when loading pretrained embeddings.
    # For example, if the token '123-456-7890' does not exist in the pretrained embeddings, then it is mapped to the embedding of '000-000-0000',
    # if it exists among the pretrained embeddings.
    # If both check_for_lowercase and check_for_digits_replaced_with_zeros are set to True, then the lowercased version is checked before the digit-zeroed version.
    check_for_digits_replaced_with_zeros = True
    
    # If freeze_token_embeddings is set to True, token embedding will remain frozen (not be trained).
    freeze_token_embeddings = False
    
    # If debug is set to True, only 200 lines will be loaded for each split of the dataset.
    debug = False
    verbose = False
    
    # plot_format specifies the format of the plots generated by NeuroNER. It should be either 'png' or 'pdf'.
    plot_format = pdf
    
    # specify which layers to reload from the pretrained model
    reload_character_embeddings = True
    reload_character_lstm = True
    reload_token_embeddings = True
    reload_token_lstm = True
    reload_feedforward = True
    reload_crf = True
    
    parameters_filepath = ./parameters.ini
    

    However when I run the python3.5 main.py or python3.5 main.py --train_model=False --use_pretrained_model=True --dataset_text_folder=../data/example_unannotated_texts --pretrained_model_folder=../trained_models/conll_2003_en I get this:

    TensorFlow version: 1.3.0
    Traceback (most recent call last):
      File "main.py", line 250, in <module>
        main()
      File "main.py", line 245, in main
        nn = NeuroNER(**arguments)
      File "/home/beast/Documents/NeuroNER-master/src/neuroner.py", line 257, in __init__
        parameters, conf_parameters = self._load_parameters(arguments['parameters_filepath'], arguments=arguments)
      File "/home/beast/Documents/NeuroNER-master/src/neuroner.py", line 118, in _load_parameters
        parameters[k] = distutils.util.strtobool(v)
    AttributeError: module 'distutils' has no attribute 'util'
    Exception ignored in: <bound method NeuroNER.__del__ of <neuroner.NeuroNER object at 0x7fb575c84588>>
    Traceback (most recent call last):
      File "/home/beast/Documents/NeuroNER-master/src/neuroner.py", line 489, in __del__
        self.sess.close()
    AttributeError: 'NeuroNER' object has no attribute 'sess'```
    
    Any ideas? 
    Thanks
    
    opened by Blair-Young 15
  • License?

    License?

    As you provide no explicit license, that makes this project unusable to anybody. No license is equivalent to "only look, don't touch". Even in a purely academic or research context, it is technically illegal to use code from a repo with no license.

    opened by fnl 13
  • Using NeuroNER with Brat with custom annotations

    Using NeuroNER with Brat with custom annotations

    Hi Franck and all the rest of you! First, thanks for all the great work with this.

    I am trying to get started with using NeuroNER on a dataset I have annotated with BRAT, within a specific domain (music), so the entities are all custom. And, I want to add this dataset, to NeuroNER, by adding the BRAT-files (the txt-file and the ann-file), and then train on that. But I am not getting anywhere.

    As I have understood it from your docs is that I should put both the train.txt and the train.ann in a folder and then point to that in parameters.ini, but I guess I misunderstood because I'm not making progress...

    Could you offer some guidance on how to get started with this?

    Best regards, Robert

    enhancement 
    opened by robertkviby 12
  • AttributeError: module 'distutils' has no attribute 'util'

    AttributeError: module 'distutils' has no attribute 'util'

    File "main.py", line 250, in main() File "main.py", line 245, in main nn = NeuroNER(**arguments) File "/home/server1/share/NeuroNER-master/src/neuroner.py", line 257, in init parameters, conf_parameters = self._load_parameters(arguments['parameters_filepath'], arguments=arguments) File "/home/server1/share/NeuroNER-master/src/neuroner.py", line 118, in _load_parameters parameters[k] = distutils.util.strtobool(v) AttributeError: module 'distutils' has no attribute 'util' Exception ignored in: <bound method NeuroNER.del of <neuroner.NeuroNER object at 0x7fb8d35fcda0>> Traceback (most recent call last): File "/home/server1/share/NeuroNER-master/src/neuroner.py", line 489, in del self.sess.close() AttributeError: 'NeuroNER' object has no attribute 'sess'

    opened by w2781993753 10
  • FileNotFoundError with conll_output_filepath

    FileNotFoundError with conll_output_filepath

    I'm trying to follow the steps from the README to run NeuroNER using the default parameters.ini settings. I'm running into a FileNotFoundError at train.py, line 95.

    I'm new to python, but will try to track down the source of the issue. It looks like maybe the file should've been created by this line, but it's not clear to me why.

    This is on ubuntu 16.04, python 3.5.2. Any advice on how to debug this issue?

    Here's partial output from running python main.py:

    Starting epoch 0
    Training completed in 0.00 seconds
    Evaluate model on the train set
    Traceback (most recent call last):
      File "main.py", line 445, in <module>
        main()
      File "main.py", line 392, in main
        y_pred, y_true, output_filepaths = train.predict_labels(sess, model, transition_params_trained, parameters, dataset, epoch_number, stats_graph_folde$
    , dataset_filepaths)
      File "/home/user/neuroner/neuroner/src/train.py", line 113, in predict_labels
        prediction_output = prediction_step(sess, dataset, dataset_type, model, transition_params_trained, stats_graph_folder, epoch_number, parameters, dat$
    set_filepaths)
      File "/home/user/neuroner/neuroner/src/train.py", line 95, in prediction_step
        with open(conll_output_filepath, 'r') as f:
    FileNotFoundError: [Errno 2] No such file or directory: '../output/en_2017-07-05_22-35-05-549137/000_train.txt_conll_evaluation.txt'
    
    
    question 
    opened by davidbenton 10
  • NeuroNER installation on Windows

    NeuroNER installation on Windows

    I am new to python and have been trying to install and run NeuroNER on windows for 2 days but its not running and i think i am not able to install it properly on windows 10 54 bit OS. The installation tutorial for Ubuntu is available but for windows i am unable to find any video tutorial. Can any one please create a step by step video tutorial or installation manual with step by step snapshots? I really need it for my MS research ASAP.

    question 
    opened by Rabia-Noureen 10
  • The output file is not created.

    The output file is not created.

    Hi.. First of all, I will explain the steps what I have done for unannotated texts.

    The dataset folder now contains a deploy folder with phrase.txt. No other files are included. In that case, when I run, main.py, I am getting the error as train.txt is not found. (FileNotFoundError).

    If I include train,test and valid files in the same folder, it's running, but I am not getting the expected output.

    As per the instructions given, I have changed these parameters in src ini file: train_model=False use_pretrained_model=True pretrained_model_folder=../trained_models/conll_2003_en dataset_text_folder=../data/dataset

    use_character_lstm=True character_embedding_dimension=25 character_lstm_hidden_state_dimension=25 token_pretrained_embedding_filepath=../data/word_vectors/glove.6B.100d.txt token_embedding_dimension=100 token_lstm_hidden_state_dimension=100

    use_crf=True tagging_format=bioes tokenizer=spacy

    Please help me to find out what error I am making in this.

    question 
    opened by elsaresearch 10
  • AttributeError: 'PolyCollection' object has no attribute 'get_axes'

    AttributeError: 'PolyCollection' object has no attribute 'get_axes'

    I see the following error. Is there a way to fix it? Thanks.

    ~/linux/test/python/nlpdl/NeuroNER/src$ python3 main.py
    NeuroNER version: 1.0-dev
    TensorFlow version: 1.4.0
    {'character_embedding_dimension': 25,
     'character_lstm_hidden_state_dimension': 25,
     'check_for_digits_replaced_with_zeros': 1,
     'check_for_lowercase': 1,
     'dataset_text_folder': '../data/conll2003/en',
     'debug': 0,
     'dropout_rate': 0.5,
     'experiment_name': 'test',
     'freeze_token_embeddings': 0,
     'gradient_clipping_value': 5.0,
     'learning_rate': 0.005,
     'load_all_pretrained_token_embeddings': 0,
     'load_only_pretrained_token_embeddings': 0,
     'main_evaluation_mode': 'conll',
     'maximum_number_of_epochs': 100,
     'number_of_cpu_threads': 8,
     'number_of_gpus': 0,
     'optimizer': 'sgd',
     'output_folder': '../output',
     'parameters_filepath': './parameters.ini',
     'patience': 10,
     'plot_format': 'pdf',
     'pretrained_model_folder': '../trained_models/conll_2003_en',
     'reload_character_embeddings': 1,
     'reload_character_lstm': 1,
     'reload_crf': 1,
     'reload_feedforward': 1,
     'reload_token_embeddings': 1,
     'reload_token_lstm': 1,
     'remap_unknown_tokens_to_unk': 1,
     'spacylanguage': 'en',
     'tagging_format': 'bioes',
     'token_embedding_dimension': 100,
     'token_lstm_hidden_state_dimension': 100,
     'token_pretrained_embedding_filepath': '../data/word_vectors/glove.6B.100d.txt',
     'tokenizer': 'spacy',
     'train_model': 1,
     'use_character_lstm': 1,
     'use_crf': 1,
     'use_pretrained_model': 0,
     'verbose': 0}
    Formatting train set from CONLL to BRAT... Done.
    Converting CONLL from BIO to BIOES format... Done.
    Formatting valid set from CONLL to BRAT... Done.
    Converting CONLL from BIO to BIOES format... Done.
    Formatting test set from CONLL to BRAT... Done.
    Converting CONLL from BIO to BIOES format... Done.
    Load dataset... done (93.21 seconds)
    Load token embeddings... done (0.43 seconds)
    number_of_token_original_case_found: 14618
    number_of_token_lowercase_found: 11723
    number_of_token_digits_replaced_with_zeros_found: 119
    number_of_token_lowercase_and_digits_replaced_with_zeros_found: 16
    number_of_loaded_word_vectors: 26476
    dataset.vocabulary_size: 28984
    
    Starting epoch 0
    Training completed in 0.00 seconds
    Evaluate model on the train set
    processed 203621 tokens with 23499 phrases; found: 198223 phrases; correct: 3218.
    accuracy:   3.71%; precision:   1.62%; recall:  13.69%; FB1:   2.90
                  LOC: precision:   1.32%; recall:   2.77%; FB1:   1.79  14951
                 MISC: precision:   1.23%; recall:  60.62%; FB1:   2.42  168908
                  ORG: precision:  16.17%; recall:   5.32%; FB1:   8.00  2078
                  PER: precision:   4.88%; recall:   9.09%; FB1:   6.35  12286
    
    Evaluate model on the valid set
    processed 51362 tokens with 5942 phrases; found: 50303 phrases; correct: 869.
    accuracy:   3.43%; precision:   1.73%; recall:  14.62%; FB1:   3.09
                  LOC: precision:   1.58%; recall:   2.94%; FB1:   2.06  3415
                 MISC: precision:   1.27%; recall:  59.87%; FB1:   2.48  43568
                  ORG: precision:  21.15%; recall:   6.04%; FB1:   9.40  383
                  PER: precision:   6.20%; recall:   9.88%; FB1:   7.62  2937
    
    Evaluate model on the test set
    processed 46435 tokens with 5648 phrases; found: 45225 phrases; correct: 689.
    accuracy:   3.43%; precision:   1.52%; recall:  12.20%; FB1:   2.71
                  LOC: precision:   1.59%; recall:   3.24%; FB1:   2.13  3404
                 MISC: precision:   1.09%; recall:  59.26%; FB1:   2.13  38275
                  ORG: precision:  16.52%; recall:   4.46%; FB1:   7.02  448
                  PER: precision:   4.68%; recall:   8.97%; FB1:   6.15  3098
    
    Generating plots for the train set
    Traceback (most recent call last):
      File "main.py", line 250, in <module>
        main()
      File "main.py", line 246, in main
        nn.fit()
      File "/Users/xxx/linux/test/python/nlpdl/NeuroNER/src/neuroner.py", line 394, in fit
        evaluate.evaluate_model(results, dataset, y_pred, y_true, stats_graph_folder, epoch_number, epoch_start_time, output_filepaths, parameters)
      File "/Users/xxx/linux/test/python/nlpdl/NeuroNER/src/evaluate.py", line 239, in evaluate_model
        verbose=verbose)
      File "/Users/xxx/linux/test/python/nlpdl/NeuroNER/src/evaluate.py", line 22, in assess_model
        cmap='RdBu')
      File "/Users/xxx/linux/test/python/nlpdl/NeuroNER/src/utils_plots.py", line 162, in plot_classification_report
        heatmap(np.array(plotMat), title, xlabel, ylabel, xticklabels, yticklabels, figure_width, figure_height, correct_orientation, cmap=cmap)
      File "/Users/xxx/linux/test/python/nlpdl/NeuroNER/src/utils_plots.py", line 112, in heatmap
        show_values(c, fmt=fmt)
      File "/Users/xxx/linux/test/python/nlpdl/NeuroNER/src/utils_plots.py", line 36, in show_values
        ax = pc.get_axes()
    AttributeError: 'PolyCollection' object has no attribute 'get_axes'
    
    opened by pengyu 8
  • What should be the max epoch_number for training?

    What should be the max epoch_number for training?

    What should be the max epoch_number value should be to train the model at the first run of main.py? I tried to change it by

    python main.py --maximum_number_of_epochs=2 --token_pretrained_embedding_filepath=""
    

    And after 3 or 4 runs the model starts training again. Should it stop at some point?

    question 
    opened by Rabia-Noureen 8
  • Only predicting

    Only predicting "O", also on provided examples

    python3.5 main.py --train_model=False --use_pretrained_model=True --dataset_text_folder=../data/example_unannotated_texts --pretrained_model_folder=../trained_models/conll_2003_en - it just yields "O's"

    Output:

    NeuroNER version: 1.0-dev
    TensorFlow version: 1.1.0
    NeuroNER version: 1.0-dev
    TensorFlow version: 1.1.0
    {'character_embedding_dimension': 25,
     'character_lstm_hidden_state_dimension': 25,
     'check_for_digits_replaced_with_zeros': 1,
     'check_for_lowercase': 1,
     'dataset_text_folder': '../data/example_unannotated_texts',
     'debug': 0,
     'dropout_rate': 0.5,
     'experiment_name': 'test',
     'freeze_token_embeddings': 0,
     'gradient_clipping_value': 5.0,
     'learning_rate': 0.005,
     'load_only_pretrained_token_embeddings': 0,
     'main_evaluation_mode': 'conll',
     'maximum_number_of_epochs': 100,
     'number_of_cpu_threads': 8,
     'number_of_gpus': 0,
     'optimizer': 'sgd',
     'output_folder': '../output',
     'parameters_filepath': './parameters.ini',
     'patience': 10,
     'plot_format': 'pdf',
     'pretrained_model_folder': '../trained_models/conll_2003_en',
     'reload_character_embeddings': 1,
     'reload_character_lstm': 1,
     'reload_crf': 1,
     'reload_feedforward': 1,
     'reload_token_embeddings': 1,
     'reload_token_lstm': 1,
     'remap_unknown_tokens_to_unk': 1,
     'spacylanguage': 'en',
     'tagging_format': 'bioes',
     'token_embedding_dimension': 100,
     'token_lstm_hidden_state_dimension': 100,
     'token_pretrained_embedding_filepath': '../data/word_vectors/glove.6B.100d.txt',
     'tokenizer': 'spacy',
     'train_model': 0,
     'use_character_lstm': 1,
     'use_crf': 1,
     'use_pretrained_model': 1,
     'verbose': 0}
    Formatting deploy set from BRAT to CONLL... Done.
    Converting CONLL from BIO to BIOES format... Done.
    Load dataset... done (40.78 seconds)
    
    Starting epoch 0
    Load token embeddings... done (89.64 seconds)
    number_of_token_original_case_found: 94
    number_of_token_lowercase_found: 25
    number_of_token_digits_replaced_with_zeros_found: 0
    number_of_token_lowercase_and_digits_replaced_with_zeros_found: 0
    number_of_loaded_word_vectors: 119
    dataset.vocabulary_size: 119
    Load token embeddings from pretrained model... done (0.22 seconds)
    number_of_loaded_vectors: 104
    dataset.vocabulary_size: 119
    Load character embeddings from pretrained model... done (0.23 seconds)
    number_of_loaded_vectors: 58
    dataset.alphabet_size: 58
    Training completed in 92.45 seconds
    Predict labels for the deploy set
    Formatting 000_deploy set from CONLL to BRAT... Done.
    Finishing the experiment
    
    question 
    opened by kootenpv 8
  • Using pre-trained model example not working

    Using pre-trained model example not working

    Hi,

    Thanks a lot for the model and the code! They are very useful.

    I'm trying to re-use the conll-2003 pre-trained model like in the example, using the example files in the same folder path (..\data\example_unannotated_texts\deploy).

    with: dataset_text_folder = ../data/example_unannotated_texts

    while the text files to be annotated are in ..\data\example_unannotated_texts\deploy

    The output text file is empty, and the loading shows that it found no tokens.

    I tried running it from dataset_text_folder = ../data/example_unannotated_texts/deploy instead, but then I get an error message (assertion error) saying that the tag is not 'O' (from the remove BIO function).

    I also get an error message from spacy (asking to download the 'en' data from it first, which I did a few times already) the first time I run the code on the data to annotate. If I run it a second time, it then runs but the spacy file created during the first run is empty (which I believe is the problem).

    Thanks for your help! Yoann

    question 
    opened by YoannMR 7
  • Bump tensorflow from 1.1.0 to 2.9.3

    Bump tensorflow from 1.1.0 to 2.9.3

    Bumps tensorflow from 1.1.0 to 2.9.3.

    Release notes

    Sourced from tensorflow's releases.

    TensorFlow 2.9.3

    Release 2.9.3

    This release introduces several vulnerability fixes:

    TensorFlow 2.9.2

    Release 2.9.2

    This releases introduces several vulnerability fixes:

    ... (truncated)

    Changelog

    Sourced from tensorflow's changelog.

    Release 2.9.3

    This release introduces several vulnerability fixes:

    Release 2.8.4

    This release introduces several vulnerability fixes:

    ... (truncated)

    Commits
    • a5ed5f3 Merge pull request #58584 from tensorflow/vinila21-patch-2
    • 258f9a1 Update py_func.cc
    • cd27cfb Merge pull request #58580 from tensorflow-jenkins/version-numbers-2.9.3-24474
    • 3e75385 Update version numbers to 2.9.3
    • bc72c39 Merge pull request #58482 from tensorflow-jenkins/relnotes-2.9.3-25695
    • 3506c90 Update RELEASE.md
    • 8dcb48e Update RELEASE.md
    • 4f34ec8 Merge pull request #58576 from pak-laura/c2.99f03a9d3bafe902c1e6beb105b2f2417...
    • 6fc67e4 Replace CHECK with returning an InternalError on failing to create python tuple
    • 5dbe90a Merge pull request #58570 from tensorflow/r2.9-7b174a0f2e4
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 1
  • Project dependencies may have API risk issues

    Project dependencies may have API risk issues

    Hi, In NeuroNER, inappropriate dependency versioning constraints can cause risks.

    Below are the dependencies and version constraints that the project is using

    matplotlib==3.0.2
    networkx==2.2
    pycorenlp==0.3.0
    scikit-learn==0.20.2
    scipy==1.2.0
    spacy==2.0.18
    tensorflow==1.1.0
    numpy==1.16.0
    

    The version constraint == will introduce the risk of dependency conflicts because the scope of dependencies is too strict. The version constraint No Upper Bound and * will introduce the risk of the missing API Error because the latest version of the dependencies may remove some APIs.

    After further analysis, in this project, The version constraint of dependency matplotlib can be changed to >=1.3.0,<=3.0.3. The version constraint of dependency networkx can be changed to >=2.0,<=2.8.4. The version constraint of dependency scikit-learn can be changed to >=0.15.0,<=0.20.4. The version constraint of dependency spacy can be changed to >=0.100.0,<=3.3.1.

    The above modification suggestions can reduce the dependency conflicts as much as possible, and introduce the latest version as much as possible without calling Error in the projects.

    The invocation of the current project includes all the following methods.

    The calling methods from the matplotlib
    matplotlib.colors.ListedColormap
    matplotlib.cm.get_cmap
    matplotlib.use
    
    The calling methods from the networkx
    max
    
    The calling methods from the scikit-learn
    sklearn.preprocessing.LabelBinarizer.fit
    sklearn.preprocessing.LabelBinarizer.transform
    sklearn.metrics.precision_recall_fscore_support
    sklearn.preprocessing.normalize
    random.choice
    sklearn.metrics.confusion_matrix
    sklearn.metrics.classification_report
    sklearn.preprocessing.LabelBinarizer
    sklearn.metrics.f1_score
    sklearn.metrics.accuracy_score
    
    The calling methods from the spacy
    spacy.load
    
    The calling methods from the all methods
    distutils.util.strtobool
    neuroner.conll_to_brat.conll_to_brat
    plot_handles.extend
    neuroner.neuromodel.fetch_data
    numpy.arange
    bioes_filepath.utils.get_basename_without_extension.split
    self.unique_label_indices_of_interest.append
    matplotlib.pyplot.axvline
    tensorflow.python.tools.inspect_checkpoint.print_tensors_in_checkpoint_file
    tensorflow.contrib.tensorboard.plugins.projector.ProjectorConfig
    tensorflow.stack
    pprint.pprint
    tensorflow.constant
    neuroner.utils.get_parameter_to_section_of_configparser
    self._parse_dataset
    pkg_resources.resource_isdir
    collections.OrderedDict
    token_dict.split
    argparse.ArgumentParser.print_help
    get_current_time_in_seconds
    sorted.append
    input_conll_filepath.utils.get_basename_without_extension.split
    annotation_filepath.codecs.open.close
    conll_to_brat
    numpy.histogram
    os.path.abspath
    dataset_type.character_indices_padded.append
    json.dump
    random.choice.split
    f.write
    neuroner.utils_nlp.is_token_in_pretrained_embeddings
    neuroner.utils.get_basename_without_extension
    config.sections
    p.vertices.mean
    time.strftime
    self.index_to_label.keys
    neuroner.utils_nlp.load_pretrained_token_embeddings
    token.split
    self.tokens_mapped_to_unk.append
    tensorflow.name_scope
    type
    os.makedirs
    codecs.open
    os.path.getsize
    pretraining_string_to_index.keys
    numpy.linspace
    str
    neuroner.brat_to_conll.check_brat_annotation_and_text_compatibility
    matplotlib.pyplot.gca.set_xticklabels
    sess.run
    matplotlib.pyplot.gca.text
    AssertionError
    matplotlib.pyplot.clf
    json.load
    line.str.replace
    tensorflow.argmax
    neuroner.evaluate.save_results
    output_conll_lines_with_bioes
    tensorflow.train.Saver.save
    neuroner.utils.reverse_dictionary.keys
    dataset_filepaths.get
    _get_default_param.items
    sorted.add
    os.path.isfile
    pickle.load
    dataset_filepaths.keys
    support.append
    line.split.replace
    split_lines.append
    all_y_true.extend
    shutil.rmtree
    matplotlib.pyplot.gcf.set_size_inches
    get_sentences_and_tokens_from_stanford.append
    neuroner.train.predict_labels
    unary_scores.tolist
    self.index_to_character.keys
    tensorflow.nn.embedding_lookup
    tensorflow.square
    matplotlib.pyplot.xlabel
    entity_lstm.EntityLSTM
    neuroner.evaluate.evaluate_model
    line.strip
    self.dataset_filepaths.update
    tensorflow.nn.tanh
    tokens.append
    sorted.sort
    tensorflow.expand_dims
    writers.add_summary
    tensorflow.zeros
    neuroner.train.prediction_step
    neuroner.neuromodel.fetch_model
    input_filepath.xml.etree.ElementTree.parse.getroot
    line.split.strip
    matplotlib.pyplot.gca.set_yticks
    matplotlib.pyplot.title
    vars
    matplotlib.pyplot.gca.invert_yaxis
    epoch_number.results.append
    kwargs.items
    self.character_indices.update
    labels_bioes.append
    neuroner.dataset.Dataset
    codecs.open.write
    argparse.ArgumentParser
    self.token_embedding_weights.assign
    self.token_embedding_weights.read_value
    neuroner.utils.pad_list
    cm2inch
    generate_reference_text_file_for_conll
    line.strip.replace
    writers.flush
    labels_bio.append
    line.strip.split.strip
    os.path.splitext
    neuroner.utils_nlp.bioes_to_bio
    matplotlib.pyplot.figure
    matplotlib.colors.ListedColormap
    file_obj.RenameUnpickler.load
    self.saver.restore
    sklearn.metrics.classification_report
    filepath.os.path.basename.replace
    datetime.datetime.now
    end_current_entity
    tensorflow.reduce_mean
    dataset_type.token_lengths.append
    map
    argparse.ArgumentParser.add_argument
    self.model.load_pretrained_token_embeddings
    remove_bio_from_label_name
    self.load_pretrained_token_embeddings
    l.rstrip.replace.replace.replace.strip.split
    tensorflow.summary.histogram
    get_entities_from_brat
    self._create_stats_graph_folder
    tensorflow.Variable
    self.sess.close
    self.optimizer.compute_gradients
    config.items
    round
    tag.get
    matplotlib.pyplot.ylabel
    self.index_to_token.keys
    self.load_embeddings_from_pretrained_model
    check_validity_of_conll_bioes
    tensorflow.summary.FileWriter
    get_sentences_and_tokens_from_stanford
    os.path.exists
    dataset_type.character_indices.append
    dataset_types.append
    os.path.join
    sys.exit
    plotMat.append
    document_count.str.zfill
    shutil.copytree
    os.system
    pycorenlp.StanfordCoreNLP
    token_dict.strip
    tensorflow.Session
    tensorflow.tile
    _get_default_param
    ax.yaxis.get_major_ticks
    _clean_param_dtypes.items
    bio_to_bioes
    sorted
    matplotlib.pyplot.bar
    neuroner.utils_plots.heatmap
    tensorflow.variable_scope
    labels.y_pred.y_true.sklearn.metrics.precision_recall_fscore_support.tolist
    print
    line.strip.split.pop
    tensorflow.reduce_max
    numpy.copy.flatten
    pc.update_scalarmappable
    matplotlib.pyplot.xlim
    output_file.write
    bidirectional_LSTM
    numpy.fill_diagonal
    IOError
    tensorflow.contrib.rnn.CoupledInputForgetGateLSTMCell
    tensorflow.clip_by_value
    neuroner.conll_to_brat.output_brat
    all_predictions.extend
    line.split
    matplotlib.pyplot.plot
    os.path.relpath
    tensorflow.assign
    open
    l.rstrip.replace.replace
    self.model.restore_from_pretrained_model
    self.modeldata.update_dataset
    sklearn.metrics.confusion_matrix.tolist
    collections.defaultdict.items
    self.dataset_brat_folders.update
    sorted.remove
    assess_model
    matplotlib.cm.get_cmap
    dataset_type.writers.close
    tensorflow.nn.bidirectional_dynamic_rnn
    len
    _get_config_param
    get_stanford_annotations
    shutil.copyfile
    embedding_weights.read_value
    neuroner.neuromodel.load_parameters
    xml_to_brat
    neuroner.utils_nlp.load_pretrained_token_embeddings.keys
    tensorflow.contrib.rnn.LSTMStateTuple
    tensorflow.contrib.layers.xavier_initializer
    plot_handles.append
    token.strip
    matplotlib.pyplot.gcf
    param_config.items
    matplotlib.pyplot.gca.barh
    neuroner.utils.convert_configparser_to_dictionary
    tensorflow.summary.scalar
    dataset_type.token_indices.append
    matplotlib.pyplot.subplots
    neuroner.brat_to_conll.get_entities_from_brat
    numpy.random.rand
    self.characters.update
    self._get_valid_dataset_filepaths
    tensorflow.contrib.crf.viterbi_decode
    prepare_pretrained_model_for_restoring
    os.listdir
    new_token_sequence.append
    range
    predictions.tolist.tolist
    neuroner.neuromodel.NeuroNER.close
    dataset_type.characters.append
    matplotlib.pyplot.grid
    _clean_param_dtypes
    pc.get_array
    tensorflow.global_variables_initializer
    dataset_type.label_indices.append
    neuroner.utils.convert_configparser_to_dictionary.items
    pretraining_dataset.label_to_index.keys
    self.define_training_procedure
    labels.append
    random.shuffle
    tensorflow.reduce_min
    pretraining_dataset.index_to_token.values
    cmap
    tensorflow.nn.xw_plus_b
    tensorflow.get_collection
    _fetch
    neuroner.utils.renamed_load
    numpy.copy
    entity.replace
    neuroner.neuromodel.NeuroNER.fit
    prediction_step
    configparser.ConfigParser.set
    trim_dataset_pickle
    matplotlib.pyplot.gca
    sklearn.preprocessing.normalize.flatten
    sklearn.metrics.f1_score
    tensorflow.train.Saver.restore
    matplotlib.pyplot.colorbar
    sklearn.preprocessing.LabelBinarizer.fit
    tensorflow.summary.merge_all
    pretraining_dataset.index_to_character.values
    trim_model_checkpoint
    self.character_indices_padded.update
    classification_report.split
    l.rstrip.replace.replace.replace.strip
    configparser.ConfigParser.read
    neuroner.utils_nlp.remove_bio_from_label_name
    dataset.token_to_index.keys
    neuroner.utils.reverse_dictionary
    os.path.dirname
    os.remove
    self._check_param_compatibility
    _clean_param_dtypes.update
    labels.copy
    show_values
    pc.get_facecolors
    self.label_vector_indices.update
    tensorflow.cast
    get_sentences_and_tokens_from_spacy
    random.choice
    neuroner.train.train_step
    operator.itemgetter
    parse_arguments.items
    tensorflow.get_variable
    dictionary.items
    self.prediction_count.str.zfill
    list
    sentence_tokens.append
    neuroner.evaluate.remap_labels
    int
    os.path.basename
    numpy.random.uniform
    tensorflow.contrib.tensorboard.plugins.projector.visualize_embeddings
    token.replace
    tensorflow.concat
    neuroner.utils_nlp.get_parsed_conll_output
    colors.append
    copy.copy
    pkg_resources.resource_filename
    classes.append
    neuroner.entity_lstm.EntityLSTM
    save_results
    tensorflow.train.GradientDescentOptimizer
    neuroner.utils_tf.resize_tensor_variable
    tensorflow.sqrt
    l.rstrip.replace.replace.replace
    classification_report.keys
    neuroner.neuromodel.NeuroNER
    infrequent_token_indices.append
    neuroner.utils_plots.plot_classification_report
    neuroner.utils_nlp.replace_unicode_whitespaces_with_ascii_whitespace
    dataset_type.label_vector_indices.append
    dataset_type.f1_dict_all.append
    epoch_number.str.zfill
    check_param_compatibility
    enumerate
    os.path.isdir
    new_label_sequence.append
    neuroner.utils.order_dictionary.get
    neuroner.utils.copytree
    neuroner.utils.order_dictionary
    super
    neuroner.brat_to_conll.brat_to_conll
    token.lower
    min
    tensorflow.nn.dropout
    class_names.append
    numpy.all
    argparse.ArgumentParser.parse_args
    str.lower
    self.label_indices.update
    embeddings_projector_config.embeddings.add
    collections.defaultdict.keys
    ax.xaxis.get_major_ticks
    sklearn.metrics.confusion_matrix
    conf_parameters.write
    spacy.load
    join
    input_filepath.xml.etree.ElementTree.parse.getroot.findall
    token.token.text.replace
    max
    self.modeldata.load_dataset
    spacy_nlp
    time.time
    tensorflow.train.AdadeltaOptimizer
    list.remove
    ax.pcolor.append
    load_parameters
    heatmap
    sklearn.preprocessing.LabelBinarizer.transform
    embedding_weights.assign
    tensorflow.ConfigProto
    self.token_indices.update
    self.sess.run
    cur_line.split.split
    matplotlib.use
    line.strip.replace.split
    matplotlib.pyplot.ylim
    tensorflow.train.Saver
    codecs.open.readline
    plot_f1_vs_epoch
    shutil.copy
    all
    self.unique_labels_of_interest.remove
    time.localtime
    cur_line.split.strip
    warnings.warn
    numpy.array
    re.sub
    tensorflow.variables_initializer
    f.read
    RenameUnpickler
    ValueError
    tensorflow.train.AdamOptimizer
    format
    set
    utils.create_folder_if_not_exists
    matplotlib.pyplot.gca.set_yticklabels
    random.randint
    neuroner.utils.get_current_time_in_miliseconds
    original_conll_file.readline.strip
    neuroner.utils.create_folder_if_not_exists
    pickle.dump
    collections.defaultdict
    neuroner.utils_nlp.convert_conll_from_bio_to_bioes
    self.__del__
    sklearn.preprocessing.normalize
    string.split
    sklearn.preprocessing.LabelBinarizer
    glob.glob
    get_cmap
    pretraining_dataset.label_to_index.copy
    pycorenlp.StanfordCoreNLP.annotate
    main
    tensorflow.equal
    output_entities
    parse_arguments
    token_dict.replace
    tensorflow.placeholder
    input_filepath.xml.etree.ElementTree.parse.getroot.findtext
    float
    result.update
    AUC.flatten.min
    get_start_and_end_offset_of_token_from_spacy
    self.unique_labels.append
    model.saver.save
    matplotlib.pyplot.legend
    writers.get_logdir
    f1.np.asarray.argmax
    dataset.__dict__.keys
    pc.get_paths
    l.rstrip.replace
    set.add
    dataset_type.result_update.update
    remap_labels
    abs
    index_to_string.items
    entities.append
    check_bio_bioes_compatibility
    tensorflow.nn.softmax_cross_entropy_with_logits
    line.strip.split.append
    tuple
    matplotlib.pyplot.gca.set_xticks
    output_filepaths.keys
    AUC.flatten.max
    self.RenameUnpickler.super.find_class
    xml.etree.ElementTree.parse
    sklearn.metrics.accuracy_score
    self._convert_to_indices
    numpy.asarray
    line.split.split
    self.token_lengths.update
    f.read.splitlines
    matplotlib.pyplot.axhline
    tensorflow.shape
    codecs.open.close
    l.rstrip
    dict
    bioes_to_bio
    matplotlib.pyplot.gca.pcolor
    warnings.filterwarnings
    neuroner.conll_to_brat.check_compatibility_between_conll_and_brat_text
    json.loads
    zip
    matplotlib.pyplot.close
    shutil.copy2
    self.sess.as_default
    neuroner.utils_tf.variable_summaries
    line.strip.split
    self.optimizer.apply_gradients
    neuroner.conll_to_brat.output_entities
    ax.xaxis.tick_top
    tensorflow.contrib.crf.crf_log_likelihood
    sklearn.metrics.precision_recall_fscore_support
    matplotlib.pyplot.savefig
    configparser.ConfigParser
    setuptools.setup
    results.keys
    tensorflow.squeeze
    

    @developer Could please help me check this issue? May I pull a request to fix it? Thank you very much.

    opened by PyDeps 0
  • link from README.md to conll2003 dataset broken

    link from README.md to conll2003 dataset broken

    the link from README.md to https://github.com/Franck-Dernoncourt/NeuroNER/blob/master/data/conll2003/en is not working.

    it is from this para: "We provide several examples of datasets: data/conll2003/en: annotated dataset with the CoNLL-2003 format, containing 3 files (train.txt, valid.txt and test.txt).

    I assume that this is not a problem per se, since the dataset is available at https://huggingface.co/datasets/conll2003. You may want to update the broken link though.

    opened by poedator 0
  • Bump numpy from 1.16.0 to 1.22.0

    Bump numpy from 1.16.0 to 1.22.0

    Bumps numpy from 1.16.0 to 1.22.0.

    Release notes

    Sourced from numpy's releases.

    v1.22.0

    NumPy 1.22.0 Release Notes

    NumPy 1.22.0 is a big release featuring the work of 153 contributors spread over 609 pull requests. There have been many improvements, highlights are:

    • Annotations of the main namespace are essentially complete. Upstream is a moving target, so there will likely be further improvements, but the major work is done. This is probably the most user visible enhancement in this release.
    • A preliminary version of the proposed Array-API is provided. This is a step in creating a standard collection of functions that can be used across application such as CuPy and JAX.
    • NumPy now has a DLPack backend. DLPack provides a common interchange format for array (tensor) data.
    • New methods for quantile, percentile, and related functions. The new methods provide a complete set of the methods commonly found in the literature.
    • A new configurable allocator for use by downstream projects.

    These are in addition to the ongoing work to provide SIMD support for commonly used functions, improvements to F2PY, and better documentation.

    The Python versions supported in this release are 3.8-3.10, Python 3.7 has been dropped. Note that 32 bit wheels are only provided for Python 3.8 and 3.9 on Windows, all other wheels are 64 bits on account of Ubuntu, Fedora, and other Linux distributions dropping 32 bit support. All 64 bit wheels are also linked with 64 bit integer OpenBLAS, which should fix the occasional problems encountered by folks using truly huge arrays.

    Expired deprecations

    Deprecated numeric style dtype strings have been removed

    Using the strings "Bytes0", "Datetime64", "Str0", "Uint32", and "Uint64" as a dtype will now raise a TypeError.

    (gh-19539)

    Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

    numpy.loads was deprecated in v1.15, with the recommendation that users use pickle.loads instead. ndfromtxt and mafromtxt were both deprecated in v1.17 - users should use numpy.genfromtxt instead with the appropriate value for the usemask parameter.

    (gh-19615)

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 1
  • Switch input/output folder vars in prepare_pretrained_model.py

    Switch input/output folder vars in prepare_pretrained_model.py

    It seems like in https://github.com/Franck-Dernoncourt/NeuroNER/blob/master/neuroner/prepare_pretrained_model.py#L105, input_model_folder should use model_name and output_model_folder should use output_folder_name.

    Do you concur?

    opened by matt-thomas 1
Releases(1.0-dev2)
Owner
Franck Dernoncourt
Franck Dernoncourt
Train 🤗transformers with DeepSpeed: ZeRO-2, ZeRO-3

Fork from https://github.com/huggingface/transformers/tree/86d5fb0b360e68de46d40265e7c707fe68c8015b/examples/pytorch/language-modeling at 2021.05.17.

Junbum Lee 12 Oct 26, 2022
PyKaldi is a Python scripting layer for the Kaldi speech recognition toolkit.

PyKaldi is a Python scripting layer for the Kaldi speech recognition toolkit. It provides easy-to-use, low-overhead, first-class Python wrappers for t

922 Dec 31, 2022
Bnagla hand written document digiiztion

Bnagla hand written document digiiztion This repo addresses the problem of digiizing hand written documents in Bangla. Documents have definite fields

Mushfiqur Rahman 1 Dec 10, 2021
A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.

Multilingual Latent Dirichlet Allocation (LDA) Pipeline This project is for text clustering using the Latent Dirichlet Allocation (LDA) algorithm. It

Artifici Online Services inc. 74 Oct 07, 2022
A method for cleaning and classifying text using transformers.

NLP Translation and Classification The repository contains a method for classifying and cleaning text using NLP transformers. Overview The input data

Ray Chamidullin 0 Nov 15, 2022
Client library to download and publish models and other files on the huggingface.co hub

huggingface_hub Client library to download and publish models and other files on the huggingface.co hub Do you have an open source ML library? We're l

Hugging Face 644 Jan 01, 2023
Hostapd-mac-tod-acl - Setup a hostapd AP with MAC ToD ACL

A brief explanation This script provides a quick way to setup a Time-of-day (Tod

2 Feb 03, 2022
Cherche (search in French) allows you to create a neural search pipeline using retrievers and pre-trained language models as rankers.

Cherche (search in French) allows you to create a neural search pipeline using retrievers and pre-trained language models as rankers. Cherche is meant to be used with small to medium sized corpora. C

Raphael Sourty 224 Nov 29, 2022
Modeling cumulative cases of Covid-19 in the US during the Covid 19 Delta wave using Bayesian methods.

Introduction The goal of this analysis is to find a model that fits the observed cumulative cases of COVID-19 in the US, starting in Mid-July 2021 and

Alexander Keeney 1 Jan 05, 2022
PyWorld3 is a Python implementation of the World3 model

The World3 model revisited in Python Install & Hello World3 How to tune your own simulation Licence How to cite PyWorld3 with Bibtex References & ackn

Charles Vanwynsberghe 248 Dec 14, 2022
[EMNLP 2021] Mirror-BERT: Converting Pretrained Language Models to universal text encoders without labels.

[EMNLP 2021] Mirror-BERT: Converting Pretrained Language Models to universal text encoders without labels.

Cambridge Language Technology Lab 61 Dec 10, 2022
📔️ Generate a text-based journal from a template file.

JGen 📔️ Generate a text-based journal from a template file. Contents Getting Started Example Overview Usage Details Reserved Keywords Gotchas Getting

Harrison Broadbent 21 Sep 25, 2022
Machine learning models from Singapore's NLP research community

SG-NLP Machine learning models from Singapore's natural language processing (NLP) research community. sgnlp is a Python package that allows you to eas

AI Singapore | AI Makerspace 21 Dec 17, 2022
💫 Industrial-strength Natural Language Processing (NLP) in Python

spaCy: Industrial-strength NLP spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest researc

Explosion 24.9k Jan 02, 2023
Source code for AAAI20 "Generating Persona Consistent Dialogues by Exploiting Natural Language Inference".

Generating Persona Consistent Dialogues by Exploiting Natural Language Inference Source code for RCDG model in AAAI20 Generating Persona Consistent Di

16 Oct 08, 2022
ElasticBERT: A pre-trained model with multi-exit transformer architecture.

This repository contains finetuning code and checkpoints for ElasticBERT. Towards Efficient NLP: A Standard Evaluation and A Strong Baseli

fastNLP 48 Dec 14, 2022
A Facebook Messenger Chatbot using NLP

A Facebook Messenger Chatbot using NLP This project is about creating a messenger chatbot using basic NLP techniques and models like Logistic Regressi

6 Nov 20, 2022
Anuvada: Interpretable Models for NLP using PyTorch

Anuvada: Interpretable Models for NLP using PyTorch So, you want to know why your classifier arrived at a particular decision or why your flashy new d

EDGE 102 Oct 01, 2022
Python library to make development of portfolio analysis faster and easier

Trafalgar Python library to make development of portfolio analysis faster and easier Installation 🔥 For the moment, Trafalgar is still in beta develo

Santosh Passoubady 641 Jan 01, 2023
To create a deep learning model which can explain the content of an image in the form of speech through caption generation with attention mechanism on Flickr8K dataset.

To create a deep learning model which can explain the content of an image in the form of speech through caption generation with attention mechanism on Flickr8K dataset.

Ragesh Hajela 0 Feb 08, 2022