A multi-lingual approach to AllenNLP CoReference Resolution along with a wrapper for spaCy.

Last update: Jan 04, 2023

Overview

Crosslingual Coreference

Coreference is amazing but the data required for training a model is very scarce. In our case, the available training for non-English languages also proved to be poorly annotated. Crosslingual Coreference, therefore, uses the assumption a trained model with English data and cross-lingual embeddings should work for languages with similar sentence structures.

Install

pip install crosslingual-coreference

Quickstart

from crosslingual_coreference import Predictor

text = (
    "Do not forget about Momofuku Ando! He created instant noodles in Osaka. At"
    " that location, Nissin was founded. Many students survived by eating these"
    " noodles, but they don't even know him."
)

# choose minilm for speed/memory and info_xlm for accuracy
predictor = Predictor(
    language="en_core_web_sm", device=-1, model_name="minilm"
)

print(predictor.predict(text)["resolved_text"])
# Output
#
# Do not forget about Momofuku Ando!
# Momofuku Ando created instant noodles in Osaka.
# At Osaka, Nissin was founded.
# Many students survived by eating instant noodles,
# but Many students don't even know Momofuku Ando.

Models

As of now, there are two models available "spanbert", "info_xlm", "xlm_roberta", "minilm", which scored 83, 77, 74 and 74 on OntoNotes Release 5.0 English data, respectively.

The "minilm" model is the best quality speed trade-off for both mult-lingual and english texts.
The "info_xlm" model produces the best quality for multi-lingual texts.
The AllenNLP "spanbert" model produces the best quality for english texts.

Chunking/batching to resolve memory OOM errors

from crosslingual_coreference import Predictor

predictor = Predictor(
    language="en_core_web_sm",
    device=0,
    model_name="minilm",
    chunk_size=2500,
    chunk_overlap=2,
)

Use spaCy pipeline

import spacy

import crosslingual_coreference

text = (
    "Do not forget about Momofuku Ando! He created instant noodles in Osaka. At"
    " that location, Nissin was founded. Many students survived by eating these"
    " noodles, but they don't even know him."
)


nlp = spacy.load("en_core_web_sm")
nlp.add_pipe(
    "xx_coref", config={"chunk_size": 2500, "chunk_overlap": 2, "device": 0}
)

doc = nlp(text)
print(doc._.coref_clusters)
# Output
#
# [[[4, 5], [7, 7], [27, 27], [36, 36]],
# [[12, 12], [15, 16]],
# [[9, 10], [27, 28]],
# [[22, 23], [31, 31]]]
print(doc._.resolved_text)
# Output
#
# Do not forget about Momofuku Ando!
# Momofuku Ando created instant noodles in Osaka.
# At Osaka, Nissin was founded.
# Many students survived by eating instant noodles,
# but Many students don't even know Momofuku Ando.

More Examples

Comments

Which language model is using for minilm
I am using the following code snippet for coreference resolution

predictor = Predictor(language="en_core_web_sm", device=-1, model_name="minilm")

While checking the below source code,

"minilm": { "url": ( "https://storage.googleapis.com/pandora-intelligence/models/crosslingual-coreference/minilm/model.tar.gz" ), "f1_score_ontonotes": 74, "file_extension": ".tar.gz", },

it seems that the language model using here is https://storage.googleapis.com/pandora-intelligence/models/crosslingual-coreference/minilm/model.tar.gz

Is this the same one that I can see in https://huggingface.co/models like https://huggingface.co/microsoft/Multilingual-MiniLM-L12-H384/tree/main or any other huggingface model?
opened by pradeepdev-1995 7

Error when using coref as a spaCy pipeline

Hi all, while trying to run a spacy test

import spacy
import crosslingual_coreference

text = """
    Do not forget about Momofuku Ando!
    He created instant noodles in Osaka.
    At that location, Nissin was founded.
    Many students survived by eating these noodles, but they don't even know him."""

# use any model that has internal spacy embeddings
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe(
    "xx_coref", config={"chunk_size": 2500, "chunk_overlap": 2, "device": 0})

doc = nlp(text)

print(doc._.coref_clusters)
print(doc._.resolved_text)

I encountered the following issue:

[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /home/user/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
Traceback (most recent call last):
  File "/home/user/test_coref/test.py", line 12, in <module>
    nlp.add_pipe(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/spacy/language.py", line 792, in add_pipe
    pipe_component = self.create_pipe(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/spacy/language.py", line 674, in create_pipe
    resolved = registry.resolve(cfg, validate=validate)
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/thinc/config.py", line 746, in resolve
    resolved, _ = cls._make(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/thinc/config.py", line 795, in _make
    filled, _, resolved = cls._fill(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/thinc/config.py", line 867, in _fill
    getter_result = getter(*args, **kwargs)
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/crosslingual_coreference/__init__.py", line 33, in make_crosslingual_coreference
    return SpacyPredictor(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/crosslingual_coreference/CrossLingualPredictorSpacy.py", line 18, in __init__
    super().__init__(language, device, model_name, chunk_size, chunk_overlap)
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/crosslingual_coreference/CrossLingualPredictor.py", line 55, in __init__
    self.set_coref_model()
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/crosslingual_coreference/CrossLingualPredictor.py", line 85, in set_coref_model
    self.predictor = Predictor.from_path(self.filename, language=self.language, cuda_device=self.device)
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/predictors/predictor.py", line 366, in from_path
    load_archive(archive_path, cuda_device=cuda_device, overrides=overrides),
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/models/archival.py", line 232, in load_archive
    dataset_reader, validation_dataset_reader = _load_dataset_readers(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/models/archival.py", line 268, in _load_dataset_readers
    dataset_reader = DatasetReader.from_params(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 604, in from_params
    return retyped_subclass.from_params(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 636, in from_params
    kwargs = create_kwargs(constructor_to_inspect, cls, params, **extras)
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 206, in create_kwargs
    constructed_arg = pop_and_construct_arg(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 314, in pop_and_construct_arg
    return construct_arg(class_name, name, popped_params, annotation, default, **extras)
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 394, in construct_arg
    value_dict[key] = construct_arg(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 348, in construct_arg
    result = annotation.from_params(params=popped_params, **subextras)
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 604, in from_params
    return retyped_subclass.from_params(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/from_params.py", line 638, in from_params
    return constructor_to_call(**kwargs)  # type: ignore
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/data/token_indexers/pretrained_transformer_mismatched_indexer.py", line 58, in __init__
    self._matched_indexer = PretrainedTransformerIndexer(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/data/token_indexers/pretrained_transformer_indexer.py", line 56, in __init__
    self._allennlp_tokenizer = PretrainedTransformerTokenizer(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/data/tokenizers/pretrained_transformer_tokenizer.py", line 72, in __init__
    self.tokenizer = cached_transformers.get_tokenizer(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/allennlp/common/cached_transformers.py", line 204, in get_tokenizer
    tokenizer = transformers.AutoTokenizer.from_pretrained(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 546, in from_pretrained
    return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1788, in from_pretrained
    return cls._from_pretrained(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1923, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/transformers/models/xlm_roberta/tokenization_xlm_roberta_fast.py", line 140, in __init__
    super().__init__(
  File "/home/user/test_coref/.venv/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 110, in __init__
    fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
Exception: EOF while parsing a list at line 1 column 4920583

Here's what I have installed (pulled by poetry add crosslingual-coreference or pip install crosslingual-coreference):

(.venv) [email protected]$ pip freeze
aiohttp==3.8.1
aiosignal==1.2.0
allennlp==2.9.3
allennlp-models==2.9.3
async-timeout==4.0.2
attrs==21.4.0
base58==2.1.1
blis==0.7.7
boto3==1.23.5
botocore==1.26.5
cached-path==1.1.2
cachetools==5.1.0
catalogue==2.0.7
certifi==2022.5.18.1
charset-normalizer==2.0.12
click==8.0.4
conllu==4.4.1
crosslingual-coreference==0.2.4
cymem==2.0.6
datasets==2.2.1
dill==0.3.5.1
docker-pycreds==0.4.0
en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl
en-core-web-trf @ https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.2.0/en_core_web_trf-3.2.0-py3-none-any.whl
fairscale==0.4.6
filelock==3.6.0
frozenlist==1.3.0
fsspec==2022.5.0
ftfy==6.1.1
gitdb==4.0.9
GitPython==3.1.27
google-api-core==2.8.0
google-auth==2.6.6
google-cloud-core==2.3.0
google-cloud-storage==2.3.0
google-crc32c==1.3.0
google-resumable-media==2.3.3
googleapis-common-protos==1.56.1
h5py==3.6.0
huggingface-hub==0.5.1
idna==3.3
iniconfig==1.1.1
Jinja2==3.1.2
jmespath==1.0.0
joblib==1.1.0
jsonnet==0.18.0
langcodes==3.3.0
lmdb==1.3.0
MarkupSafe==2.1.1
more-itertools==8.13.0
multidict==6.0.2
multiprocess==0.70.12.2
murmurhash==1.0.7
nltk==3.7
numpy==1.22.4
packaging==21.3
pandas==1.4.2
pathtools==0.1.2
pathy==0.6.1
Pillow==9.1.1
pluggy==1.0.0
preshed==3.0.6
promise==2.3
protobuf==3.20.1
psutil==5.9.1
py==1.11.0
py-rouge==1.1
pyarrow==8.0.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pydantic==1.8.2
pyparsing==3.0.9
pytest==7.1.2
python-dateutil==2.8.2
pytz==2022.1
PyYAML==6.0
regex==2022.4.24
requests==2.27.1
responses==0.18.0
rsa==4.8
s3transfer==0.5.2
sacremoses==0.0.53
scikit-learn==1.1.1
scipy==1.6.1
sentence-transformers==2.2.0
sentencepiece==0.1.96
sentry-sdk==1.5.12
setproctitle==1.2.3
shortuuid==1.0.9
six==1.16.0
smart-open==5.2.1
smmap==5.0.0
spacy==3.2.4
spacy-alignments==0.8.5
spacy-legacy==3.0.9
spacy-loggers==1.0.2
spacy-sentence-bert==0.1.2
spacy-transformers==1.1.5
srsly==2.4.3
tensorboardX==2.5
termcolor==1.1.0
thinc==8.0.16
threadpoolctl==3.1.0
tokenizers==0.12.1
tomli==2.0.1
torch==1.10.2
torchaudio==0.10.2
torchvision==0.11.3
tqdm==4.64.0
transformers==4.17.0
typer==0.4.1
typing-extensions==4.2.0
urllib3==1.26.9
wandb==0.12.16
wasabi==0.9.1
wcwidth==0.2.5
word2number==1.1
xxhash==3.0.0
yarl==1.7.2

Do you have any recommendations? Is there an installation step missing?

Thanks in advance!

opened by alexander-belikov 4

Comparatively high initial prediction time for first predict() hit

I am using minilm model with language 'en_core_web_sm'. While comparing the prediction time, i.e., predictor.predict(text), the prediction time for first hit is always a bit high than the following hits. Suppose after creating a predictor object, I call predict as follows:

predictor.predict(text) ---> first call predictor.predict(text) ---> second call predictor.predict(text) ---> third call

Time taken for the first call is comparatively a bit higher(.2 sec) than the next prediction calls(.05 sec). Could you please help me understand why this initial hit takes a bit high prediction time?

opened by nemeer 2
Why does this package need to install google cloud auth, storage, api etc?

Hi,

after installing the library I saw google-api-core-2.10.1 google-auth-2.12.0 google-cloud-core-2.3.2 google-cloud-storage-1.44.0 have been installed as well. In fact these packages can be found in the poetry.lock file.

Is there a reason (I don't get) why this library needs these packages?

Thanks

opened by GiacomoCherry 1
HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /pandora-intelligence/models/crosslingual-coreference/minilm/model.tar.gz

Python 3.8.13 Spacy - 3.1.0 en_core_web_sm-3.1.0 crosslingual_coreference - 0.2.8

requests.exceptions.SSLError: HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /pandora-intelligence/models/crosslingual-coreference/minilm/model.tar.gz (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1131)')))

opened by jscoder1009 1
Retrieving cluster heads without replacing corefs

I am interested in being able to extract the cluster heads with something like doc._.coref_cluster_heads to get the cluster heads without getting the reconstituted text. It could be a separate function that also acts as input into replace_corefs potentially.

opened by MikeMikeMikeMike 1
[Errno 101] Network is unreachable
Hello, when I try to run the code below

predictor = Predictor( language="en_core_web_sm", device=1, model_name="info_xlm" )

I get the following error:

ConnectionError: HTTPSConnectionPool(host='cdn-lfs.huggingface.co', port=443): Max retries exceeded with url: /microsoft/infoxlm-base/cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7ff90cba1a00>: Failed to establish a new connection: [Errno 101] Network is unreachable'))

Is this url still valid and what should I use instead?
opened by ttranslit 1
spaCy issues and suggestions
@martin-kirilov It might be worth looking into including batching + training a model for Spanish/Italian. See this issue from spaCy.

batching

empty cluster issue (resolved)

additional model pro-drop languages

bug enhancement
opened by davidberenstein1957 0
feat: look into ONNX enhanched transformer embeddings

Creating embeddings roughly takes 50% of the inference time. allennlp/modules/token_embedders/pretrained_transformer_embedder.py hold the logic for creating these embeddings. Make sure we can call them in a faster way.
enhancement

opened by davidberenstein1957 3

Releases(0.2.9)

0.2.9(Sep 24, 2022)

Source code(tar.gz)
Source code(zip)
0.2.6(Jun 8, 2022)

Source code(tar.gz)
Source code(zip)
0.2.5(May 25, 2022)

added cluster_heads to the implementation within doc._.cluster_heads and predict['cluster_heads']
Source code(tar.gz)
Source code(zip)
0.2.4(May 10, 2022)

Source code(tar.gz)
Source code(zip)
0.2.3(May 5, 2022)

x 3 speedup combined with minilm vs. initial 'default' info_xlm model.
Source code(tar.gz)
Source code(zip)
0.2.2(May 5, 2022)

Source code(tar.gz)
Source code(zip)
0.2.1(Apr 13, 2022)

check new automated release
Source code(tar.gz)
Source code(zip)
v0.2.0(Apr 3, 2022)

#3
Source code(tar.gz)
Source code(zip)
v.0.1.5(Mar 31, 2022)

Without finding clusters, the package would just return the text. Now it returns the prediciton.
Source code(tar.gz)
Source code(zip)
v0.1.4(Mar 30, 2022)

Improve the dependencies to resolve specific issues with Typer v0.4.0 and Click v8.1.0 and a combination of AllenNLP and spaCy versions.
Source code(tar.gz)
Source code(zip)
v0.1.3(Mar 28, 2022)

The initial version of crosslingual-coreference.
Source code(tar.gz)
Source code(zip)

Owner

Pandora Intelligence

Pandora Intelligence is an independent intelligence company, specialized in security risks.

GitHub Repository

Machine translation models released by the Gourmet project

Gourmet Models Overview The Gourmet project has released several machine translation models to translate low-resource languages. This repository conta

5 Dec 08, 2021

Shirt Bot is a discord bot which uses GPT-3 to generate text

SHIRT BOT · Shirt Bot is a discord bot which uses GPT-3 to generate text. Made by Cyclcrclicly#3420 (474183744685604865) on Discord. Support Server EX

31 Oct 31, 2022

Develop open-source Python Arabic NLP libraries that the Arab world will easily use in all Natural Language Processing applications

2 Oct 22, 2022

A multi-lingual approach to AllenNLP CoReference Resolution along with a wrapper for spaCy.

Related tags

Overview

Crosslingual Coreference

Install

Quickstart

Models

Chunking/batching to resolve memory OOM errors

Use spaCy pipeline

More Examples

Comments

Releases(0.2.9)

0.2.9(Sep 24, 2022)

0.2.6(Jun 8, 2022)

0.2.5(May 25, 2022)

0.2.4(May 10, 2022)

0.2.3(May 5, 2022)

0.2.2(May 5, 2022)

0.2.1(Apr 13, 2022)

v0.2.0(Apr 3, 2022)

v.0.1.5(Mar 31, 2022)

v0.1.4(Mar 30, 2022)

v0.1.3(Mar 28, 2022)

Owner

Pandora Intelligence

Machine translation models released by the Gourmet project

Shirt Bot is a discord bot which uses GPT-3 to generate text

Develop open-source Python Arabic NLP libraries that the Arab world will easily use in all Natural Language Processing applications

An open-source NLP library: fast text cleaning and preprocessing.

SGMC: Spectral Graph Matrix Completion

Bu Chatbot, Konya Bilim Merkezi Yen için tasarlanmış olan bir projedir.

TalkNet: Audio-visual active speaker detection Model

Conditional probing: measuring usable information beyond a baseline

PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

Reading Wikipedia to Answer Open-Domain Questions

[WWW 2021 GLB] New Benchmarks for Learning on Non-Homophilous Graphs

Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS)

OceanScript is an Esoteric language used to encode and decode text into a formulation of characters

Python library for processing Chinese text

A multi-voice TTS system trained with an emphasis on quality

SIGIR'22 paper: Axiomatically Regularized Pre-training for Ad hoc Search

LCG T-TEST USING EUCLIDEAN METHOD

Simple virtual assistant using pyttsx3 and speech recognition optionally with pywhatkit and pther libraries.

Script to generate VAD dataset used in Asteroid recipe

COVID-19 Related NLP Papers