A multi-lingual approach to AllenNLP CoReference Resolution along with a wrapper for spaCy.

Overview

Crosslingual Coreference

Coreference is amazing but the data required for training a model is very scarce. In our case, the available training for non-English languages also proved to be poorly annotated. Crosslingual Coreference, therefore, uses the assumption a trained model with English data and cross-lingual embeddings should work for languages with similar sentence structures.

Current Release Version pypi Version PyPi downloads Code style: black

Install

pip install crosslingual-coreference

Quickstart

from crosslingual_coreference import Predictor

text = (
    "Do not forget about Momofuku Ando! He created instant noodles in Osaka. At"
    " that location, Nissin was founded. Many students survived by eating these"
    " noodles, but they don't even know him."
)

# choose minilm for speed/memory and info_xlm for accuracy
predictor = Predictor(
    language="en_core_web_sm", device=-1, model_name="minilm"
)

print(predictor.predict(text)["resolved_text"])
# Output
#
# Do not forget about Momofuku Ando!
# Momofuku Ando created instant noodles in Osaka.
# At Osaka, Nissin was founded.
# Many students survived by eating instant noodles,
# but Many students don't even know Momofuku Ando.

Models

As of now, there are two models available "spanbert", "info_xlm", "xlm_roberta", "minilm", which scored 83, 77, 74 and 74 on OntoNotes Release 5.0 English data, respectively.

  • The "minilm" model is the best quality speed trade-off for both mult-lingual and english texts.
  • The "info_xlm" model produces the best quality for multi-lingual texts.
  • The AllenNLP "spanbert" model produces the best quality for english texts.

Chunking/batching to resolve memory OOM errors

from crosslingual_coreference import Predictor

predictor = Predictor(
    language="en_core_web_sm",
    device=0,
    model_name="minilm",
    chunk_size=2500,
    chunk_overlap=2,
)

Use spaCy pipeline

import spacy

import crosslingual_coreference

text = (
    "Do not forget about Momofuku Ando! He created instant noodles in Osaka. At"
    " that location, Nissin was founded. Many students survived by eating these"
    " noodles, but they don't even know him."
)


nlp = spacy.load("en_core_web_sm")
nlp.add_pipe(
    "xx_coref", config={"chunk_size": 2500, "chunk_overlap": 2, "device": 0}
)

doc = nlp(text)
print(doc._.coref_clusters)
# Output
#
# [[[4, 5], [7, 7], [27, 27], [36, 36]],
# [[12, 12], [15, 16]],
# [[9, 10], [27, 28]],
# [[22, 23], [31, 31]]]
print(doc._.resolved_text)
# Output
#
# Do not forget about Momofuku Ando!
# Momofuku Ando created instant noodles in Osaka.
# At Osaka, Nissin was founded.
# Many students survived by eating instant noodles,
# but Many students don't even know Momofuku Ando.

More Examples

Comments
  • Which language model is using for minilm

    Which language model is using for minilm

    I am using the following code snippet for coreference resolution

    predictor = Predictor(language="en_core_web_sm", device=-1, model_name="minilm")
    

    While checking the below source code,

    "minilm": {
            "url": (
                "https://storage.googleapis.com/pandora-intelligence/models/crosslingual-coreference/minilm/model.tar.gz"
            ),
            "f1_score_ontonotes": 74,
            "file_extension": ".tar.gz",
        },
    

    it seems that the language model using here is https://storage.googleapis.com/pandora-intelligence/models/crosslingual-coreference/minilm/model.tar.gz

    Is this the same one that I can see in https://huggingface.co/models like https://huggingface.co/microsoft/Multilingual-MiniLM-L12-H384/tree/main or any other huggingface model?

    opened by pradeepdev-1995 7
  • Error when using coref as a spaCy pipeline

    Error when using coref as a spaCy pipeline

    Hi all, while trying to run a spacy test

    import spacy
    import crosslingual_coreference
    
    text = """
        Do not forget about Momofuku Ando!
        He created instant noodles in Osaka.
        At that location, Nissin was founded.
        Many students survived by eating these noodles, but they don't even know him."""
    
    # use any model that has internal spacy embeddings
    nlp = spacy.load('en_core_web_sm')
    nlp.add_pipe(
        "xx_coref", config={"chunk_size": 2500, "chunk_overlap": 2, "device": 0})
    
    doc = nlp(text)
    
    print(doc._.coref_clusters)
    print(doc._.resolved_text)
    

    I encountered the following issue:

    [nltk_data] Downloading package omw-1.4 to
    [nltk_data]     /home/user/nltk_data...
    [nltk_data]   Package omw-1.4 is already up-to-date!
    Traceback (most recent call last):
      File "/home/user/test_coref/test.py", line 12, in <module>
        nlp.add_pipe(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/spacy/language.py", line 792, in add_pipe
        pipe_component = self.create_pipe(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/spacy/language.py", line 674, in create_pipe
        resolved = registry.resolve(cfg, validate=validate)
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/thinc/config.py", line 746, in resolve
        resolved, _ = cls._make(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/thinc/config.py", line 795, in _make
        filled, _, resolved = cls._fill(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/thinc/config.py", line 867, in _fill
        getter_result = getter(*args, **kwargs)
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/crosslingual_coreference/__init__.py", line 33, in make_crosslingual_coreference
        return SpacyPredictor(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/crosslingual_coreference/CrossLingualPredictorSpacy.py", line 18, in __init__
        super().__init__(language, device, model_name, chunk_size, chunk_overlap)
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/crosslingual_coreference/CrossLingualPredictor.py", line 55, in __init__
        self.set_coref_model()
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/crosslingual_coreference/CrossLingualPredictor.py", line 85, in set_coref_model
        self.predictor = Predictor.from_path(self.filename, language=self.language, cuda_device=self.device)
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/predictors/predictor.py", line 366, in from_path
        load_archive(archive_path, cuda_device=cuda_device, overrides=overrides),
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/models/archival.py", line 232, in load_archive
        dataset_reader, validation_dataset_reader = _load_dataset_readers(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/models/archival.py", line 268, in _load_dataset_readers
        dataset_reader = DatasetReader.from_params(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 604, in from_params
        return retyped_subclass.from_params(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 636, in from_params
        kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras)
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 206, in create_kwargs
        constructed_arg = pop_and_construct_arg(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 314, in pop_and_construct_arg
        return construct_arg(class_name, name, popped_params, annotation, default, **extras)
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 394, in construct_arg
        value_dict[key] = construct_arg(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 348, in construct_arg
        result = annotation.from_params(params=popped_params, **subextras)
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 604, in from_params
        return retyped_subclass.from_params(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 638, in from_params
        return constructor_to_call(**kwargs)  # type: ignore
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/data/token_indexers/pretrained_transformer_mismatched_indexer.py", line 58, in __init__
        self._matched_indexer = PretrainedTransformerIndexer(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/data/token_indexers/pretrained_transformer_indexer.py", line 56, in __init__
        self._allennlp_tokenizer = PretrainedTransformerTokenizer(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/data/tokenizers/pretrained_transformer_tokenizer.py", line 72, in __init__
        self.tokenizer = cached_transformers.get_tokenizer(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/cached_transformers.py", line 204, in get_tokenizer
        tokenizer = transformers.AutoTokenizer.from_pretrained(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 546, in from_pretrained
        return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1788, in from_pretrained
        return cls._from_pretrained(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1923, in _from_pretrained
        tokenizer = cls(*init_inputs, **init_kwargs)
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/transformers/models/xlm_roberta/tokenization_xlm_roberta_fast.py", line 140, in __init__
        super().__init__(
      File "/home/user/test_coref/.venv/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 110, in __init__
        fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
    Exception: EOF while parsing a list at line 1 column 4920583
    

    Here's what I have installed (pulled by poetry add crosslingual-coreference or pip install crosslingual-coreference):

    (.venv) [email protected]$ pip freeze
    aiohttp==3.8.1
    aiosignal==1.2.0
    allennlp==2.9.3
    allennlp-models==2.9.3
    async-timeout==4.0.2
    attrs==21.4.0
    base58==2.1.1
    blis==0.7.7
    boto3==1.23.5
    botocore==1.26.5
    cached-path==1.1.2
    cachetools==5.1.0
    catalogue==2.0.7
    certifi==2022.5.18.1
    charset-normalizer==2.0.12
    click==8.0.4
    conllu==4.4.1
    crosslingual-coreference==0.2.4
    cymem==2.0.6
    datasets==2.2.1
    dill==0.3.5.1
    docker-pycreds==0.4.0
    en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl
    en-core-web-trf @ https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.2.0/en_core_web_trf-3.2.0-py3-none-any.whl
    fairscale==0.4.6
    filelock==3.6.0
    frozenlist==1.3.0
    fsspec==2022.5.0
    ftfy==6.1.1
    gitdb==4.0.9
    GitPython==3.1.27
    google-api-core==2.8.0
    google-auth==2.6.6
    google-cloud-core==2.3.0
    google-cloud-storage==2.3.0
    google-crc32c==1.3.0
    google-resumable-media==2.3.3
    googleapis-common-protos==1.56.1
    h5py==3.6.0
    huggingface-hub==0.5.1
    idna==3.3
    iniconfig==1.1.1
    Jinja2==3.1.2
    jmespath==1.0.0
    joblib==1.1.0
    jsonnet==0.18.0
    langcodes==3.3.0
    lmdb==1.3.0
    MarkupSafe==2.1.1
    more-itertools==8.13.0
    multidict==6.0.2
    multiprocess==0.70.12.2
    murmurhash==1.0.7
    nltk==3.7
    numpy==1.22.4
    packaging==21.3
    pandas==1.4.2
    pathtools==0.1.2
    pathy==0.6.1
    Pillow==9.1.1
    pluggy==1.0.0
    preshed==3.0.6
    promise==2.3
    protobuf==3.20.1
    psutil==5.9.1
    py==1.11.0
    py-rouge==1.1
    pyarrow==8.0.0
    pyasn1==0.4.8
    pyasn1-modules==0.2.8
    pydantic==1.8.2
    pyparsing==3.0.9
    pytest==7.1.2
    python-dateutil==2.8.2
    pytz==2022.1
    PyYAML==6.0
    regex==2022.4.24
    requests==2.27.1
    responses==0.18.0
    rsa==4.8
    s3transfer==0.5.2
    sacremoses==0.0.53
    scikit-learn==1.1.1
    scipy==1.6.1
    sentence-transformers==2.2.0
    sentencepiece==0.1.96
    sentry-sdk==1.5.12
    setproctitle==1.2.3
    shortuuid==1.0.9
    six==1.16.0
    smart-open==5.2.1
    smmap==5.0.0
    spacy==3.2.4
    spacy-alignments==0.8.5
    spacy-legacy==3.0.9
    spacy-loggers==1.0.2
    spacy-sentence-bert==0.1.2
    spacy-transformers==1.1.5
    srsly==2.4.3
    tensorboardX==2.5
    termcolor==1.1.0
    thinc==8.0.16
    threadpoolctl==3.1.0
    tokenizers==0.12.1
    tomli==2.0.1
    torch==1.10.2
    torchaudio==0.10.2
    torchvision==0.11.3
    tqdm==4.64.0
    transformers==4.17.0
    typer==0.4.1
    typing-extensions==4.2.0
    urllib3==1.26.9
    wandb==0.12.16
    wasabi==0.9.1
    wcwidth==0.2.5
    word2number==1.1
    xxhash==3.0.0
    yarl==1.7.2
    

    Do you have any recommendations? Is there an installation step missing?

    Thanks in advance!

    opened by alexander-belikov 4
  • Comparatively high initial prediction time for first predict() hit

    Comparatively high initial prediction time for first predict() hit

    I am using minilm model with language 'en_core_web_sm'. While comparing the prediction time, i.e., predictor.predict(text), the prediction time for first hit is always a bit high than the following hits. Suppose after creating a predictor object, I call predict as follows:

    predictor.predict(text) ---> first call predictor.predict(text) ---> second call predictor.predict(text) ---> third call

    Time taken for the first call is comparatively a bit higher(.2 sec) than the next prediction calls(.05 sec). Could you please help me understand why this initial hit takes a bit high prediction time?

    opened by nemeer 2
  • Why does this package need to install google cloud auth, storage, api etc?

    Why does this package need to install google cloud auth, storage, api etc?

    Hi,

    after installing the library I saw google-api-core-2.10.1 google-auth-2.12.0 google-cloud-core-2.3.2 google-cloud-storage-1.44.0 have been installed as well. In fact these packages can be found in the poetry.lock file.

    Is there a reason (I don't get) why this library needs these packages?

    Thanks

    opened by GiacomoCherry 1
  • HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /pandora-intelligence/models/crosslingual-coreference/minilm/model.tar.gz

    HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /pandora-intelligence/models/crosslingual-coreference/minilm/model.tar.gz

    Python 3.8.13 Spacy - 3.1.0 en_core_web_sm-3.1.0 crosslingual_coreference - 0.2.8

    requests.exceptions.SSLError: HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /pandora-intelligence/models/crosslingual-coreference/minilm/model.tar.gz (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1131)')))

    opened by jscoder1009 1
  • Retrieving cluster heads without replacing corefs

    Retrieving cluster heads without replacing corefs

    I am interested in being able to extract the cluster heads with something like doc._.coref_cluster_heads to get the cluster heads without getting the reconstituted text. It could be a separate function that also acts as input into replace_corefs potentially.

    opened by MikeMikeMikeMike 1
  • [Errno 101] Network is unreachable

    [Errno 101] Network is unreachable

    Hello, when I try to run the code below

    predictor = Predictor(
        language="en_core_web_sm", device=1, model_name="info_xlm"
    )
    

    I get the following error:

    ConnectionError: HTTPSConnectionPool(host='cdn-lfs.huggingface.co', port=443): Max retries exceeded with url: /microsoft/infoxlm-base/cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7ff90cba1a00>: Failed to establish a new connection: [Errno 101] Network is unreachable'))

    Is this url still valid and what should I use instead?

    opened by ttranslit 1
  • spaCy issues and suggestions

    spaCy issues and suggestions

    @martin-kirilov It might be worth looking into including batching + training a model for Spanish/Italian. See this issue from spaCy.

    • batching
    • empty cluster issue (resolved)
    • additional model pro-drop languages
    bug enhancement 
    opened by davidberenstein1957 0
  • feat: look into ONNX enhanched transformer embeddings

    feat: look into ONNX enhanched transformer embeddings

    Creating embeddings roughly takes 50% of the inference time. allennlp/modules/token_embedders/pretrained_transformer_embedder.py hold the logic for creating these embeddings. Make sure we can call them in a faster way.

    enhancement 
    opened by davidberenstein1957 3
Releases(0.2.9)
Owner
Pandora Intelligence
Pandora Intelligence is an independent intelligence company, specialized in security risks.
Pandora Intelligence
KR-FinBert And KR-FinBert-SC

KR-FinBert & KR-FinBert-SC Much progress has been made in the NLP (Natural Language Processing) field, with numerous studies showing that domain adapt

5 Jul 29, 2022
Deal or No Deal? End-to-End Learning for Negotiation Dialogues

Introduction This is a PyTorch implementation of the following research papers: (1) Hierarchical Text Generation and Planning for Strategic Dialogue (

Facebook Research 1.4k Dec 29, 2022
MPNet: Masked and Permuted Pre-training for Language Understanding

MPNet MPNet: Masked and Permuted Pre-training for Language Understanding, by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu, is a novel pre-tr

Microsoft 228 Nov 21, 2022
Utilities for preprocessing text for deep learning with Keras

Note: This utility is really old and is no longer maintained. You should use keras.layers.TextVectorization instead of this. Utilities for pre-process

Hamel Husain 180 Dec 09, 2022
Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

Bunkai Bunkai is a sentence boundary (SB) disambiguation tool for Japanese texts. Quick Start $ pip install bunkai $ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎ

Megagon Labs 160 Dec 23, 2022
A Python 3.6+ package to run .many files, where many programs written in many languages may exist in one file.

RunMany Intro | Installation | VSCode Extension | Usage | Syntax | Settings | About A tool to run many programs written in many languages from one fil

6 May 22, 2022
txtai: Build AI-powered semantic search applications in Go

txtai: Build AI-powered semantic search applications in Go txtai executes machine-learning workflows to transform data and build AI-powered semantic s

NeuML 49 Dec 06, 2022
Tools to download and cleanup Common Crawl data

cc_net Tools to download and clean Common Crawl as introduced in our paper CCNet. If you found these resources useful, please consider citing: @inproc

Meta Research 483 Jan 02, 2023
This repo contains simple to use, pretrained/training-less models for speaker diarization.

PyDiar This repo contains simple to use, pretrained/training-less models for speaker diarization. Supported Models Binary Key Speaker Modeling Based o

12 Jan 20, 2022
Code for papers "Generation-Augmented Retrieval for Open-Domain Question Answering" and "Reader-Guided Passage Reranking for Open-Domain Question Answering", ACL 2021

This repo provides the code of the following papers: (GAR) "Generation-Augmented Retrieval for Open-domain Question Answering", ACL 2021 (RIDER) "Read

morning 49 Dec 26, 2022
Russian words synonyms and antonyms

ru_synonyms Russian words synonyms and antonyms. Install pip install git+https://github.com/ahmados/rusynonyms.git Usage from ru_synonyms import Anto

sumekenov 7 Dec 14, 2022
Kestrel Threat Hunting Language

Kestrel Threat Hunting Language What is Kestrel? Why we need it? How to hunt with XDR support? What is the science behind it? You can find all the ans

Open Cybersecurity Alliance 201 Dec 16, 2022
2021 AI CUP Competition on Traditional Chinese Scene Text Recognition - Intermediate Contest

繁體中文場景文字辨識 程式碼說明 組別:這就是我 成員:蔣明憲 唐碩謙 黃玥菱 林冠霆 蕭靖騰 目錄 環境套件 安裝方式 資料夾布局 前處理-製作偵測訓練註解檔 前處理-製作分類訓練樣本 part.py : 從 json 裁切出分類訓練樣本 Class.py : 將切出來的樣本按照文字分類到各資料夾

HuanyueTW 3 Jan 14, 2022
Python interface for converting Penn Treebank trees to Stanford Dependencies and Universal Depenencies

PyStanfordDependencies Python interface for converting Penn Treebank trees to Universal Dependencies and Stanford Dependencies. Example usage Start by

David McClosky 64 May 08, 2022
Enterprise Scale NLP with Hugging Face & SageMaker Workshop series

Workshop: Enterprise-Scale NLP with Hugging Face & Amazon SageMaker Earlier this year we announced a strategic collaboration with Amazon to make it ea

Philipp Schmid 161 Dec 16, 2022
Galois is an auto code completer for code editors (or any text editor) based on OpenAI GPT-2.

Galois is an auto code completer for code editors (or any text editor) based on OpenAI GPT-2. It is trained (finetuned) on a curated list of approximately 45K Python (~470MB) files gathered from the

Galois Autocompleter 91 Sep 23, 2022
Auto_code_complete is a auto word-completetion program which allows you to customize it on your needs

auto_code_complete is a auto word-completetion program which allows you to customize it on your needs. the model for this program is one of the deep-learning NLP(Natural Language Process) model struc

RUO 2 Feb 22, 2022
xFormers is a modular and field agnostic library to flexibly generate transformer architectures by interoperable and optimized building blocks.

Description xFormers is a modular and field agnostic library to flexibly generate transformer architectures by interoperable and optimized building bl

Facebook Research 2.3k Jan 08, 2023
English loanwords in the world's languages

Wiktionary as CLDF Content cldf1 and cldf2 contain cldf-conform data sets with a total of 2 377 756 entries about the vocabulary of all 1403 languages

Viktor Martinović 3 Jan 14, 2022
This repository is home to the Optimus data transformation plugins for various data processing needs.

Transformers Optimus's transformation plugins are implementations of Task and Hook interfaces that allows execution of arbitrary jobs in optimus. To i

Open Data Platform 37 Dec 14, 2022