A multi-lingual approach to AllenNLP CoReference Resolution along with a wrapper for spaCy.

Overview

Crosslingual Coreference

Coreference is amazing but the data required for training a model is very scarce. In our case, the available training for non-English languages also proved to be poorly annotated. Crosslingual Coreference, therefore, uses the assumption a trained model with English data and cross-lingual embeddings should work for languages with similar sentence structures.

Current Release Version pypi Version PyPi downloads Code style: black

Install

pip install crosslingual-coreference

Quickstart

from crosslingual_coreference import Predictor

text = (
    "Do not forget about Momofuku Ando! He created instant noodles in Osaka. At"
    " that location, Nissin was founded. Many students survived by eating these"
    " noodles, but they don't even know him."
)

# choose minilm for speed/memory and info_xlm for accuracy
predictor = Predictor(
    language="en_core_web_sm", device=-1, model_name="minilm"
)

print(predictor.predict(text)["resolved_text"])
# Output
#
# Do not forget about Momofuku Ando!
# Momofuku Ando created instant noodles in Osaka.
# At Osaka, Nissin was founded.
# Many students survived by eating instant noodles,
# but Many students don't even know Momofuku Ando.

Models

As of now, there are two models available "spanbert", "info_xlm", "xlm_roberta", "minilm", which scored 83, 77, 74 and 74 on OntoNotes Release 5.0 English data, respectively.

  • The "minilm" model is the best quality speed trade-off for both mult-lingual and english texts.
  • The "info_xlm" model produces the best quality for multi-lingual texts.
  • The AllenNLP "spanbert" model produces the best quality for english texts.

Chunking/batching to resolve memory OOM errors

from crosslingual_coreference import Predictor

predictor = Predictor(
    language="en_core_web_sm",
    device=0,
    model_name="minilm",
    chunk_size=2500,
    chunk_overlap=2,
)

Use spaCy pipeline

import spacy

import crosslingual_coreference

text = (
    "Do not forget about Momofuku Ando! He created instant noodles in Osaka. At"
    " that location, Nissin was founded. Many students survived by eating these"
    " noodles, but they don't even know him."
)


nlp = spacy.load("en_core_web_sm")
nlp.add_pipe(
    "xx_coref", config={"chunk_size": 2500, "chunk_overlap": 2, "device": 0}
)

doc = nlp(text)
print(doc._.coref_clusters)
# Output
#
# [[[4, 5], [7, 7], [27, 27], [36, 36]],
# [[12, 12], [15, 16]],
# [[9, 10], [27, 28]],
# [[22, 23], [31, 31]]]
print(doc._.resolved_text)
# Output
#
# Do not forget about Momofuku Ando!
# Momofuku Ando created instant noodles in Osaka.
# At Osaka, Nissin was founded.
# Many students survived by eating instant noodles,
# but Many students don't even know Momofuku Ando.

More Examples

Comments
  • Which language model is using for minilm

    Which language model is using for minilm

    I am using the following code snippet for coreference resolution

    predictor = Predictor(language="en_core_web_sm", device=-1, model_name="minilm")
    

    While checking the below source code,

    "minilm": {
            "url": (
                "https://storage.googleapis.com/pandora-intelligence/models/crosslingual-coreference/minilm/model.tar.gz"
            ),
            "f1_score_ontonotes": 74,
            "file_extension": ".tar.gz",
        },
    

    it seems that the language model using here is https://storage.googleapis.com/pandora-intelligence/models/crosslingual-coreference/minilm/model.tar.gz

    Is this the same one that I can see in https://huggingface.co/models like https://huggingface.co/microsoft/Multilingual-MiniLM-L12-H384/tree/main or any other huggingface model?

    opened by pradeepdev-1995 7
  • Error when using coref as a spaCy pipeline

    Error when using coref as a spaCy pipeline

    Hi all, while trying to run a spacy test

    import spacy
    import crosslingual_coreference
    
    text = """
        Do not forget about Momofuku Ando!
        He created instant noodles in Osaka.
        At that location, Nissin was founded.
        Many students survived by eating these noodles, but they don't even know him."""
    
    # use any model that has internal spacy embeddings
    nlp = spacy.load('en_core_web_sm')
    nlp.add_pipe(
        "xx_coref", config={"chunk_size": 2500, "chunk_overlap": 2, "device": 0})
    
    doc = nlp(text)
    
    print(doc._.coref_clusters)
    print(doc._.resolved_text)
    

    I encountered the following issue:

    [nltk_data] Downloading package omw-1.4 to
    [nltk_data]     /home/user/nltk_data...
    [nltk_data]   Package omw-1.4 is already up-to-date!
    Traceback (most recent call last):
      File "/home/user/test_coref/test.py", line 12, in <module>
        nlp.add_pipe(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/spacy/language.py", line 792, in add_pipe
        pipe_component = self.create_pipe(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/spacy/language.py", line 674, in create_pipe
        resolved = registry.resolve(cfg, validate=validate)
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/thinc/config.py", line 746, in resolve
        resolved, _ = cls._make(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/thinc/config.py", line 795, in _make
        filled, _, resolved = cls._fill(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/thinc/config.py", line 867, in _fill
        getter_result = getter(*args, **kwargs)
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/crosslingual_coreference/__init__.py", line 33, in make_crosslingual_coreference
        return SpacyPredictor(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/crosslingual_coreference/CrossLingualPredictorSpacy.py", line 18, in __init__
        super().__init__(language, device, model_name, chunk_size, chunk_overlap)
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/crosslingual_coreference/CrossLingualPredictor.py", line 55, in __init__
        self.set_coref_model()
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/crosslingual_coreference/CrossLingualPredictor.py", line 85, in set_coref_model
        self.predictor = Predictor.from_path(self.filename, language=self.language, cuda_device=self.device)
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/predictors/predictor.py", line 366, in from_path
        load_archive(archive_path, cuda_device=cuda_device, overrides=overrides),
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/models/archival.py", line 232, in load_archive
        dataset_reader, validation_dataset_reader = _load_dataset_readers(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/models/archival.py", line 268, in _load_dataset_readers
        dataset_reader = DatasetReader.from_params(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 604, in from_params
        return retyped_subclass.from_params(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 636, in from_params
        kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras)
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 206, in create_kwargs
        constructed_arg = pop_and_construct_arg(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 314, in pop_and_construct_arg
        return construct_arg(class_name, name, popped_params, annotation, default, **extras)
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 394, in construct_arg
        value_dict[key] = construct_arg(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 348, in construct_arg
        result = annotation.from_params(params=popped_params, **subextras)
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 604, in from_params
        return retyped_subclass.from_params(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 638, in from_params
        return constructor_to_call(**kwargs)  # type: ignore
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/data/token_indexers/pretrained_transformer_mismatched_indexer.py", line 58, in __init__
        self._matched_indexer = PretrainedTransformerIndexer(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/data/token_indexers/pretrained_transformer_indexer.py", line 56, in __init__
        self._allennlp_tokenizer = PretrainedTransformerTokenizer(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/data/tokenizers/pretrained_transformer_tokenizer.py", line 72, in __init__
        self.tokenizer = cached_transformers.get_tokenizer(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/cached_transformers.py", line 204, in get_tokenizer
        tokenizer = transformers.AutoTokenizer.from_pretrained(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 546, in from_pretrained
        return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1788, in from_pretrained
        return cls._from_pretrained(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1923, in _from_pretrained
        tokenizer = cls(*init_inputs, **init_kwargs)
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/transformers/models/xlm_roberta/tokenization_xlm_roberta_fast.py", line 140, in __init__
        super().__init__(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 110, in __init__
        fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
    Exception: EOF while parsing a list at line 1 column 4920583
    

    Here's what I have installed (pulled by poetry add crosslingual-coreference or pip install crosslingual-coreference):

    (.venv) [email protected]$ pip freeze
    aiohttp==3.8.1
    aiosignal==1.2.0
    allennlp==2.9.3
    allennlp-models==2.9.3
    async-timeout==4.0.2
    attrs==21.4.0
    base58==2.1.1
    blis==0.7.7
    boto3==1.23.5
    botocore==1.26.5
    cached-path==1.1.2
    cachetools==5.1.0
    catalogue==2.0.7
    certifi==2022.5.18.1
    charset-normalizer==2.0.12
    click==8.0.4
    conllu==4.4.1
    crosslingual-coreference==0.2.4
    cymem==2.0.6
    datasets==2.2.1
    dill==0.3.5.1
    docker-pycreds==0.4.0
    en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl
    en-core-web-trf @ https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.2.0/en_core_web_trf-3.2.0-py3-none-any.whl
    fairscale==0.4.6
    filelock==3.6.0
    frozenlist==1.3.0
    fsspec==2022.5.0
    ftfy==6.1.1
    gitdb==4.0.9
    GitPython==3.1.27
    google-api-core==2.8.0
    google-auth==2.6.6
    google-cloud-core==2.3.0
    google-cloud-storage==2.3.0
    google-crc32c==1.3.0
    google-resumable-media==2.3.3
    googleapis-common-protos==1.56.1
    h5py==3.6.0
    huggingface-hub==0.5.1
    idna==3.3
    iniconfig==1.1.1
    Jinja2==3.1.2
    jmespath==1.0.0
    joblib==1.1.0
    jsonnet==0.18.0
    langcodes==3.3.0
    lmdb==1.3.0
    MarkupSafe==2.1.1
    more-itertools==8.13.0
    multidict==6.0.2
    multiprocess==0.70.12.2
    murmurhash==1.0.7
    nltk==3.7
    numpy==1.22.4
    packaging==21.3
    pandas==1.4.2
    pathtools==0.1.2
    pathy==0.6.1
    Pillow==9.1.1
    pluggy==1.0.0
    preshed==3.0.6
    promise==2.3
    protobuf==3.20.1
    psutil==5.9.1
    py==1.11.0
    py-rouge==1.1
    pyarrow==8.0.0
    pyasn1==0.4.8
    pyasn1-modules==0.2.8
    pydantic==1.8.2
    pyparsing==3.0.9
    pytest==7.1.2
    python-dateutil==2.8.2
    pytz==2022.1
    PyYAML==6.0
    regex==2022.4.24
    requests==2.27.1
    responses==0.18.0
    rsa==4.8
    s3transfer==0.5.2
    sacremoses==0.0.53
    scikit-learn==1.1.1
    scipy==1.6.1
    sentence-transformers==2.2.0
    sentencepiece==0.1.96
    sentry-sdk==1.5.12
    setproctitle==1.2.3
    shortuuid==1.0.9
    six==1.16.0
    smart-open==5.2.1
    smmap==5.0.0
    spacy==3.2.4
    spacy-alignments==0.8.5
    spacy-legacy==3.0.9
    spacy-loggers==1.0.2
    spacy-sentence-bert==0.1.2
    spacy-transformers==1.1.5
    srsly==2.4.3
    tensorboardX==2.5
    termcolor==1.1.0
    thinc==8.0.16
    threadpoolctl==3.1.0
    tokenizers==0.12.1
    tomli==2.0.1
    torch==1.10.2
    torchaudio==0.10.2
    torchvision==0.11.3
    tqdm==4.64.0
    transformers==4.17.0
    typer==0.4.1
    typing-extensions==4.2.0
    urllib3==1.26.9
    wandb==0.12.16
    wasabi==0.9.1
    wcwidth==0.2.5
    word2number==1.1
    xxhash==3.0.0
    yarl==1.7.2
    

    Do you have any recommendations? Is there an installation step missing?

    Thanks in advance!

    opened by alexander-belikov 4
  • Comparatively high initial prediction time for first predict() hit

    Comparatively high initial prediction time for first predict() hit

    I am using minilm model with language 'en_core_web_sm'. While comparing the prediction time, i.e., predictor.predict(text), the prediction time for first hit is always a bit high than the following hits. Suppose after creating a predictor object, I call predict as follows:

    predictor.predict(text) ---> first call predictor.predict(text) ---> second call predictor.predict(text) ---> third call

    Time taken for the first call is comparatively a bit higher(.2 sec) than the next prediction calls(.05 sec). Could you please help me understand why this initial hit takes a bit high prediction time?

    opened by nemeer 2
  • Why does this package need to install google cloud auth, storage, api etc?

    Why does this package need to install google cloud auth, storage, api etc?

    Hi,

    after installing the library I saw google-api-core-2.10.1 google-auth-2.12.0 google-cloud-core-2.3.2 google-cloud-storage-1.44.0 have been installed as well. In fact these packages can be found in the poetry.lock file.

    Is there a reason (I don't get) why this library needs these packages?

    Thanks

    opened by GiacomoCherry 1
  • HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /pandora-intelligence/models/crosslingual-coreference/minilm/model.tar.gz

    HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /pandora-intelligence/models/crosslingual-coreference/minilm/model.tar.gz

    Python 3.8.13 Spacy - 3.1.0 en_core_web_sm-3.1.0 crosslingual_coreference - 0.2.8

    requests.exceptions.SSLError: HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /pandora-intelligence/models/crosslingual-coreference/minilm/model.tar.gz (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1131)')))

    opened by jscoder1009 1
  • Retrieving cluster heads without replacing corefs

    Retrieving cluster heads without replacing corefs

    I am interested in being able to extract the cluster heads with something like doc._.coref_cluster_heads to get the cluster heads without getting the reconstituted text. It could be a separate function that also acts as input into replace_corefs potentially.

    opened by MikeMikeMikeMike 1
  • [Errno 101] Network is unreachable

    [Errno 101] Network is unreachable

    Hello, when I try to run the code below

    predictor = Predictor(
        language="en_core_web_sm", device=1, model_name="info_xlm"
    )
    

    I get the following error:

    ConnectionError: HTTPSConnectionPool(host='cdn-lfs.huggingface.co', port=443): Max retries exceeded with url: /microsoft/infoxlm-base/cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7ff90cba1a00>: Failed to establish a new connection: [Errno 101] Network is unreachable'))

    Is this url still valid and what should I use instead?

    opened by ttranslit 1
  • spaCy issues and suggestions

    spaCy issues and suggestions

    @martin-kirilov It might be worth looking into including batching + training a model for Spanish/Italian. See this issue from spaCy.

    • batching
    • empty cluster issue (resolved)
    • additional model pro-drop languages
    bug enhancement 
    opened by davidberenstein1957 0
  • feat: look into ONNX enhanched transformer embeddings

    feat: look into ONNX enhanched transformer embeddings

    Creating embeddings roughly takes 50% of the inference time. allennlp/modules/token_embedders/pretrained_transformer_embedder.py hold the logic for creating these embeddings. Make sure we can call them in a faster way.

    enhancement 
    opened by davidberenstein1957 3
Releases(0.2.9)
Owner
Pandora Intelligence
Pandora Intelligence is an independent intelligence company, specialized in security risks.
Pandora Intelligence
This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

Pipeline For NLP with Bloom's Taxonomy Using Improved Question Classification and Question Generation using Deep Learning This repository contains all

Rohan Mathur 9 Jul 17, 2021
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (SV2TTS)

This repository is an implementation of Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (SV2TTS) with a vocoder that works in real-time. Feel free to check my the

Corentin Jemine 38.5k Jan 03, 2023
Implementation of Multistream Transformers in Pytorch

Multistream Transformers Implementation of Multistream Transformers in Pytorch. This repository deviates slightly from the paper, where instead of usi

Phil Wang 47 Jul 26, 2022
This is the 25 + 1 year anniversary version of the 1995 Rachford-Rice contest

Rachford-Rice Contest This is the 25 + 1 year anniversary version of the 1995 Rachford-Rice contest. Can you solve the Rachford-Rice problem for all t

13 Sep 20, 2022
Random-Word-Generator - Generates meaningful words from dictionary with given no. of letters and words.

Random Word Generator Generates meaningful words from dictionary with given no. of letters and words. This might be useful for generating short links

Mohammed Rabil 1 Jan 01, 2022
NL. The natural language programming language.

NL A Natural-Language programming language. Built using Codex. A few examples are inside the nl_projects directory. How it works Write any code in pur

2 Jan 17, 2022
(ACL 2022) The source code for the paper "Towards Abstractive Grounded Summarization of Podcast Transcripts"

Towards Abstractive Grounded Summarization of Podcast Transcripts We provide the source code for the paper "Towards Abstractive Grounded Summarization

10 Jul 01, 2022
Python bot created with Selenium that can guess the daily Wordle word correct 96.8% of the time.

Wordle_Bot Python bot created with Selenium that can guess the daily Wordle word correct 96.8% of the time. It will log onto the wordle website and en

Lucas Polidori 15 Dec 11, 2022
This converter will create the exact measure for your cappuccino recipe from the grandiose Rafaella Ballerini!

About CappuccinoJs This converter will create the exact measure for your cappuccino recipe from the grandiose Rafaella Ballerini! Este conversor criar

Arthur Ottoni Ribeiro 48 Nov 15, 2022
Code for Editing Factual Knowledge in Language Models

KnowledgeEditor Code for Editing Factual Knowledge in Language Models (https://arxiv.org/abs/2104.08164). @inproceedings{decao2021editing, title={Ed

Nicola De Cao 86 Nov 28, 2022
PyTorch source code of NAACL 2019 paper "An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models"

This repository contains source code for NAACL 2019 paper "An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models" (P

Alexandra Chronopoulou 89 Aug 12, 2022
Augmenty is an augmentation library based on spaCy for augmenting texts.

Augmenty: The cherry on top of your NLP pipeline Augmenty is an augmentation library based on spaCy for augmenting texts. Besides a wide array of high

Kenneth Enevoldsen 124 Dec 29, 2022
OpenAI CLIP text encoders for multiple languages!

Multilingual-CLIP OpenAI CLIP text encoders for any language Colab Notebook · Pre-trained Models · Report Bug Overview OpenAI recently released the pa

Fredrik Carlsson 481 Dec 30, 2022
Contains analysis of trends from Fitbit Dataset (source: Kaggle) to see how the trends can be applied to Bellabeat customers and Bellabeat products

Contains analysis of trends from Fitbit Dataset (source: Kaggle) to see how the trends can be applied to Bellabeat customers and Bellabeat products.

Leah Pathan Khan 2 Jan 12, 2022
Bpe algorithm can finetune tokenizer - Bpe algorithm can finetune tokenizer

"# bpe_algorithm_can_finetune_tokenizer" this is an implyment for https://github

张博 1 Feb 02, 2022
Kinky furry assitant based on GPT2

KinkyFurs-V0 Kinky furry assistant based on GPT2 How to run python3 V0.py then, open web browser and go to localhost:8080 Requirements: Flask trans

Sparki 1 Jun 11, 2022
Correctly generate plurals, ordinals, indefinite articles; convert numbers to words

NAME inflect.py - Correctly generate plurals, singular nouns, ordinals, indefinite articles; convert numbers to words. SYNOPSIS import inflect p = in

Jason R. Coombs 762 Dec 29, 2022
CVSS: A Massively Multilingual Speech-to-Speech Translation Corpus

CVSS: A Massively Multilingual Speech-to-Speech Translation Corpus CVSS is a massively multilingual-to-English speech-to-speech translation corpus, co

Google Research Datasets 118 Jan 06, 2023
DeepPavlov Tutorials

DeepPavlov tutorials DeepPavlov: Sentence Classification with Word Embeddings DeepPavlov: Transfer Learning with BERT. Classification, Tagging, QA, Ze

Neural Networks and Deep Learning lab, MIPT 28 Sep 13, 2022
Pipeline for fast building text classification TF-IDF + LogReg baselines.

Text Classification Baseline Pipeline for fast building text classification TF-IDF + LogReg baselines. Usage Instead of writing custom code for specif

Dani El-Ayyass 57 Dec 07, 2022