Official Stanford NLP Python Library for Many Human Languages

Last update: Jan 02, 2023

Overview

Stanza: A Python NLP Library for Many Human Languages

The Stanford NLP Group's official Python NLP library. It contains support for running various accurate natural language processing tools on 60+ languages and for accessing the Java Stanford CoreNLP software from Python. For detailed information please visit our official website.

🔥 A new collection of biomedical and clinical English model packages are now available, offering seamless experience for syntactic analysis and named entity recognition (NER) from biomedical literature text and clinical notes. For more information, check out our Biomedical models documentation page.

References

If you use this library in your research, please kindly cite our ACL2020 Stanza system demo paper:

@inproceedings{qi2020stanza,
    title={Stanza: A {Python} Natural Language Processing Toolkit for Many Human Languages},
    author={Qi, Peng and Zhang, Yuhao and Zhang, Yuhui and Bolton, Jason and Manning, Christopher D.},
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations",
    year={2020}
}

If you use our biomedical and clinical models, please also cite our Stanza Biomedical Models description paper:

@article{zhang2020biomedical,
  title={Biomedical and Clinical English Model Packages in the Stanza Python NLP Library},
  author={Zhang, Yuhao and Zhang, Yuhui and Qi, Peng and Manning, Christopher D. and Langlotz, Curtis P.},
  journal={arXiv preprint arXiv:2007.14640},
  year={2020}
}

The PyTorch implementation of the neural pipeline in this repository is due to Peng Qi, Yuhao Zhang, and Yuhui Zhang, with help from Jason Bolton, Tim Dozat and John Bauer. Maintenance of this repo is currently led by John Bauer.

If you use the CoreNLP software through Stanza, please cite the CoreNLP software package and the respective modules as described here ("Citing Stanford CoreNLP in papers"). The CoreNLP client is mostly written by Arun Chaganty, and Jason Bolton spearheaded merging the two projects together.

Issues and Usage Q&A

To ask questions, report issues or request features 🤔 , please use the GitHub Issue Tracker. Before creating a new issue, please make sure to search for existing issues that may solve your problem, or visit the Frequently Asked Questions (FAQ) page on our website.

Contributing to Stanza

We welcome community contributions to Stanza in the form of bugfixes 🛠️ and enhancements 💡 ! If you want to contribute, please first read our contribution guideline.

Installation

pip

Stanza supports Python 3.6 or later. We recommend that you install Stanza via pip, the Python package manager. To install, simply run:

pip install stanza

This should also help resolve all of the dependencies of Stanza, for instance PyTorch 1.3.0 or above.

If you currently have a previous version of stanza installed, use:

pip install stanza -U

Anaconda

To install Stanza via Anaconda, use the following conda command:

conda install -c stanfordnlp stanza

Note that for now installing Stanza via Anaconda does not work for Python 3.8. For Python 3.8 please use pip installation.

From Source

Alternatively, you can also install from source of this git repository, which will give you more flexibility in developing on top of Stanza. For this option, run

git clone https://github.com/stanfordnlp/stanza.git
cd stanza
pip install -e .

Running Stanza

Getting Started with the neural pipeline

To run your first Stanza pipeline, simply following these steps in your Python interactive interpreter:

>>> import stanza
>>> stanza.download('en')       # This downloads the English models for the neural pipeline
>>> nlp = stanza.Pipeline('en') # This sets up a default neural pipeline in English
>>> doc = nlp("Barack Obama was born in Hawaii.  He was elected president in 2008.")
>>> doc.sentences[0].print_dependencies()

The last command will print out the words in the first sentence in the input string (or Document, as it is represented in Stanza), as well as the indices for the word that governs it in the Universal Dependencies parse of that sentence (its "head"), along with the dependency relation between the words. The output should look like:

('Barack', '4', 'nsubj:pass')
('Obama', '1', 'flat')
('was', '4', 'aux:pass')
('born', '0', 'root')
('in', '6', 'case')
('Hawaii', '4', 'obl')
('.', '4', 'punct')

See our getting started guide for more details.

Accessing Java Stanford CoreNLP software

Aside from the neural pipeline, this package also includes an official wrapper for accessing the Java Stanford CoreNLP software with Python code.

There are a few initial setup steps.

Download Stanford CoreNLP and models for the language you wish to use
Put the model jars in the distribution folder
Tell the Python code where Stanford CoreNLP is located by setting the CORENLP_HOME environment variable (e.g., in *nix): export CORENLP_HOME=/path/to/stanford-corenlp-4.1.0

We provide comprehensive examples in our documentation that show how one can use CoreNLP through Stanza and extract various annotations from it.

Online Colab Notebooks

To get your started, we also provide interactive Jupyter notebooks in the demo folder. You can also open these notebooks and run them interactively on Google Colab. To view all available notebooks, follow these steps:

Go to the Google Colab website
Navigate to File -> Open notebook, and choose GitHub in the pop-up menu
Note that you do not need to give Colab access permission to your github account
Type stanfordnlp/stanza in the search bar, and click enter

Trained Models for the Neural Pipeline

We currently provide models for all of the Universal Dependencies treebanks v2.5, as well as NER models for a few widely-spoken languages. You can find instructions for downloading and using these models here.

Batching To Maximize Pipeline Speed

To maximize speed performance, it is essential to run the pipeline on batches of documents. Running a for loop on one sentence at a time will be very slow. The best approach at this time is to concatenate documents together, with each document separated by a blank line (i.e., two line breaks \n\n). The tokenizer will recognize blank lines as sentence breaks. We are actively working on improving multi-document processing.

Training your own neural pipelines

All neural modules in this library can be trained with your own data. The tokenizer, the multi-word token (MWT) expander, the POS/morphological features tagger, the lemmatizer and the dependency parser require CoNLL-U formatted data, while the NER model requires the BIOES format. Currently, we do not support model training via the Pipeline interface. Therefore, to train your own models, you need to clone this git repository and run training from the source.

For detailed step-by-step guidance on how to train and evaluate your own models, please visit our training documentation.

LICENSE

Stanza is released under the Apache License, Version 2.0. See the LICENSE file for more details.

Comments

google.protobuf.message.DecodeError: Error parsing message

Description I think this is similar to a bug in the old python library: python-stanford-corenlp. I'm trying to copy the demo for the client hereor here. but with my own texts... text2 works and text3 doesn't, the only differemce between them in the very last word.

The error I get is:

Traceback (most recent call last):
  File "C:/gitProjects/patentmoto2/scratch4.py", line 23, in <module>
    ann = client.annotate(text)
  File "C:\gitProjects\patentmoto2\venv\lib\site-packages\stanfordnlp\server\client.py", line 403, in annotate
    parseFromDelimitedString(doc, r.content)
  File "C:\gitProjects\patentmoto2\venv\lib\site-packages\stanfordnlp\protobuf\__init__.py", line 18, in parseFromDelimitedString
    obj.ParseFromString(buf[offset+pos:offset+pos+size])
google.protobuf.message.DecodeError: Error parsing message

To Reproduce

Steps to reproduce the behavior:


print('---')
print('input text')
print('')

text = "Chris Manning is a nice person. Chris wrote a simple sentence. He also gives oranges to people."
text2 = "We claim:1. A photographic camera for three dimension photography comprising:a housing having an opening to the interior for light rays;means for immovably locating photosensitive material in communication with the interior of the housing at a location during a time for exposure;optical means in said housing for projecting light rays, which are received through said opening from a scene to be photographed, along an optical path to said location, said path having a first position therealong extending transversely to the direction of the path from a first side to a second side of the path, the optical means comprisinga lenticular screen extending across said path at a second position farther along said path from the first position and having, on one side, a plurality of elongated lenticular elements of width P which face in the direction from which the light rays are being projected and having an opposite side facing and positioned for contact with the surface of such located photosensitive material,the optical means being characterized in that it changes, by a predetermined distance Y, on such surface of the photosensitive material, the position of light rays which come from a substantially common point on such scene and which extend along said first and second sides of said path;means for blocking the received light rays at said first position;an aperture movable transversely across said path at said first position, from said first side to said second said, for exposing said light rays sequentially to the photosensitive material moving across said screen in a direction normal to the elongation of said lenticular elements; andmeans for so moving said aperture for a predetermined time for exposure while simultaneously and synchronously moving said screen, substantially throughout said predetermined time for exposure, in substantially the same direction as the light rays sequentially expose said photosensitive material and over a distance substantially equal to the sum of P + Y to thereby expose a substantially continuous unreversed image of the scene on the photosensitive material, said means for and doing this all day long and."
text3 = "We claim:1. A photographic camera for three dimension photography comprising:a housing having an opening to the interior for light rays;means for immovably locating photosensitive material in communication with the interior of the housing at a location during a time for exposure;optical means in said housing for projecting light rays, which are received through said opening from a scene to be photographed, along an optical path to said location, said path having a first position therealong extending transversely to the direction of the path from a first side to a second side of the path, the optical means comprisinga lenticular screen extending across said path at a second position farther along said path from the first position and having, on one side, a plurality of elongated lenticular elements of width P which face in the direction from which the light rays are being projected and having an opposite side facing and positioned for contact with the surface of such located photosensitive material,the optical means being characterized in that it changes, by a predetermined distance Y, on such surface of the photosensitive material, the position of light rays which come from a substantially common point on such scene and which extend along said first and second sides of said path;means for blocking the received light rays at said first position;an aperture movable transversely across said path at said first position, from said first side to said second said, for exposing said light rays sequentially to the photosensitive material moving across said screen in a direction normal to the elongation of said lenticular elements; andmeans for so moving said aperture for a predetermined time for exposure while simultaneously and synchronously moving said screen, substantially throughout said predetermined time for exposure, in substantially the same direction as the light rays sequentially expose said photosensitive material and over a distance substantially equal to the sum of P + Y to thereby expose a substantially continuous unreversed image of the scene on the photosensitive material, said means for and doing this all day long and his."

text = text3
print(text)


print('---')
print('starting up Java Stanford CoreNLP Server...')


with CoreNLPClient(endpoint="http://localhost:9000", annotators=['tokenize', 'ssplit', 'pos', 'lemma', 'ner', 'parse', 'depparse', 'coref'],
                   timeout=70000, memory='16G', threads=10, be_quiet=False) as client:

    ann = client.annotate(text)


    sentence = ann.sentence[0]


    print('---')
    print('constituency parse of first sentence')
    constituency_parse = sentence.parseTree
    print(constituency_parse)

Expected behavior I expect it to finish. text=text2 succeeds, but text=text3 fails with the above error. The only difference between the texts is the last word 'his' (could really be anything I think).

Environment:

OS: Windows 10
Python version: 3.7.4 (tags/v3.7.4:e09359112e, Jul 8 2019, 20:34:20) [MSC v.1916 64 bit (AMD64)]
CoreNLP 3.9.2
corenlp-protobuf==3.8.0
protobuf==3.10.0
stanfordnlp==0.2.0
torch==1.1.0

Additional context I've also gotten a timeout error for some sentences, but it's intermittent. I'm not sure of they're related, but this is easier to reproduce.

bug awaiting feedback

opened by legolego 41

FileNotFoundError: Could not find any treebank files which matched extern_data/ud2/ud-treebanks-v2.8/UD_English-TEST/*-ud-train.conllu

Hi, a couple of questions that are related.

I'm trying to train a new model for a new language, but I'm first trying the data included in the packages to know more about how Stanza works when training data.

When I run the command

python3 -m stanza.utils.datasets.prepare_tokenizer_treebank UD_English-TEST

the following error appears:

(nlp) [email protected] oe_lemmatizer_stanza % python3 -m stanza.utils.datasets.prepare_tokenizer_treebank UD_English-TEST 2022-06-27 16:45:52 INFO: Datasets program called with: /Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_tokenizer_treebank.py UD_English-TEST Traceback (most recent call last): File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 1136, in <module> main() File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 1133, in main common.main(process_treebank, add_specific_args) File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/common.py", line 134, in main process_treebank(treebank, paths, args) File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/prepare_tokenizer_treebank.py", line 1116, in process_treebank train_conllu_file = common.find_treebank_dataset_file(treebank, udbase_dir, "train", "conllu", fail=True) File "/Users/dario/virtual-environments/nlp/lib/python3.10/site-packages/stanza/utils/datasets/common.py", line 37, in find_treebank_dataset_file raise FileNotFoundError("Could not find any treebank files which matched {}".format(filename)) FileNotFoundError: Could not find any treebank files which matched extern_data/ud2/ud-treebanks-v2.8/UD_English-TEST/*-ud-train.conllu

The path I am using is the exact one that comes with the package when cloning it from GitHub. My idea is to replace the files with my own ones. I have tried closed issues about some similar errors to this one, but the solutions are not applicable to my problem.

Also, I'm following the documentation for this in https://stanfordnlp.github.io/stanza/training.html#converting-ud-data, but no info is given about the train, test, and dev data. Is the script going to generate the dev and test ones? Do I need to generate them? I'm new to this, and the language I'm trying to add is not in the Universal Dependencies, I have found some datasets in .conll format, which I have converted to .conllu following Stanza documentation.

Any ideas?

Thanks!
question

opened by dmetola 40

"AnnotationException: Could not handle incoming annotation" Problem [QUESTION]

Greeting,

I am new to CoreNLP enviroment and trying run the example code given on documentation. However, I got two errors as follows;

First code: from stanza.server import CoreNLPClient with CoreNLPClient( annotators=['tokenize','ssplit','pos',"ner"], timeout=30000, memory='2G',be_quiet=True) as client: anno = client.annotate(text)

2020-12-30 16:40:53 INFO: Writing properties to tmp file: corenlp_server-a15136448b834f79.props 2020-12-30 16:40:53 INFO: Starting server with command: java -Xmx2G -cp C:\Users\fatih\stanza_corenlp* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 30000 -threads 5 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-a15136448b834f79.props -annotators tokenize,ssplit,pos,ner -preload -outputFormat serialized

`Traceback (most recent call last):

  File "C:\Users\fatih\anaconda3\lib\site-packages\stanza\server\client.py", line 446, in _request
    r.raise_for_status()
  File "C:\Users\fatih\anaconda3\lib\site-packages\requests\models.py", line 941, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
HTTPError: 500 Server Error: Internal Server Error for url: http://localhost:9000/?properties=%7B%27annotators%27%3A+%27tokenize%2Cssplit%2Cpos%2Cner%27%2C+%27outputFormat%27%3A+%27serialized%27%7D&resetDefault=false
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "<ipython-input-6-2fbdcdb77b41>", line 6, in <module>
    anno = client.annotate(text)
  File "C:\Users\fatih\anaconda3\lib\site-packages\stanza\server\client.py", line 514, in annotate
    r = self._request(text.encode('utf-8'), request_properties, reset_default, **kwargs)
  File "C:\Users\fatih\anaconda3\lib\site-packages\stanza\server\client.py", line 452, in _request
    raise AnnotationException(r.text)
AnnotationException: Could not handle incoming annotation`

What am I doing wrong? It's on windows, Anaconda, Spyder.

question

opened by fatihbozdag 38

How can i run multiple stanza NER models parallel to eachother?

I want to run multiple stanza NER models, but i want to run them parallel to each other? how can I do so? I tried to do this using torch multiprocessing by creating multiple processes and each process run each models but it doesn't seem to go well

processes = [] for i in range(4): # No. of processes p = mp.Process(target=test, args=(model,)) p.start() processes.append(p) for p in processes: p.join()
question fixed on dev

opened by mennatallah644 33

Dependency parsing in StanfordCoreNLP and Stanza giving different result

I did dependency parsing using StanfordCoreNLP using the code below

from stanfordcorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP('stanford-corenlp-full-2018-10-05', lang='en')

sentence = 'The clothes in the dressing room are gorgeous. Can I have one?'
tree_str = nlp.parse(sentence)
print(tree_str)

And I got the output:

  (S
    (NP
      (NP (DT The) (NNS clothes))
      (PP (IN in)
        (NP (DT the) (VBG dressing) (NN room))))
    (VP (VBP are)
      (ADJP (JJ gorgeous)))
    (. .)))

How can I get this same output in Stanza??

import stanza
from stanza.server import CoreNLPClient
classpath='/stanford-corenlp-full-2020-04-20/*'
client = CoreNLPClient(be_quite=False, classpath=classpath, annotators=['parse'], memory='4G', endpoint='http://localhost:8900')
client.start()
text = 'The clothes in the dressing room are gorgeous. Can I have one?'
ann = client.annotate(text)
sentence = ann.sentence[0]
dependency_parse = sentence.basicDependencies
print(dependency_parse)

In stanza It appears I have to split the sentences that makes up the sentence. Is there something I am doing wrong?

Please note that my objective is to extract noun phrases.

question

opened by ajesujoba 31

PermanentlyFailedException: Timed out waiting for service to come alive. Part3

Hi! I know this is similar to #52 and #91 but I am unable to understand how that was solved.

When I run it on the commandline (Ubuntu : Ubuntu 16.04.6 LTS), it runs with success as below:

java -Xmx16G -cp "/home/naive/Documents/shrikant/Dialogue_Implement/DST/stanford-corenlp-full-2018-10-05/*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 60000 -threads 5 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-34d0c1fe4d724a56.props -preload tokenize,ssplit,pos,lemma,ner

[main] INFO CoreNLP - --- StanfordCoreNLPServer#main() called ---
[main] INFO CoreNLP - setting default constituency parser
[main] INFO CoreNLP - using SR parser: edu/stanford/nlp/models/srparser/englishSR.ser.gz
[main] INFO CoreNLP -     Threads: 5
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[main] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.6 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ner
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [1.2 sec].
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [0.5 sec].
[main] INFO edu.stanford.nlp.ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [0.7 sec].
[main] INFO edu.stanford.nlp.time.JollyDayHolidays - Initializing JollyDayHoliday for SUTime from classpath edu/stanford/nlp/models/sutime/jollyday/Holidays_sutime.xml as sutime.binder.1.
[main] INFO edu.stanford.nlp.time.TimeExpressionExtractorImpl - Using following SUTime rules: edu/stanford/nlp/models/sutime/defs.sutime.txt,edu/stanford/nlp/models/sutime/english.sutime.txt,edu/stanford/nlp/models/sutime/english.holidays.sutime.txt
[main] INFO edu.stanford.nlp.pipeline.TokensRegexNERAnnotator - ner.fine.regexner: Read 580704 unique entries out of 581863 from edu/stanford/nlp/models/kbp/english/gazetteers/regexner_caseless.tab, 0 TokensRegex patterns.
[main] INFO edu.stanford.nlp.pipeline.TokensRegexNERAnnotator - ner.fine.regexner: Read 4869 unique entries out of 4869 from edu/stanford/nlp/models/kbp/english/gazetteers/regexner_cased.tab, 0 TokensRegex patterns.
[main] INFO edu.stanford.nlp.pipeline.TokensRegexNERAnnotator - ner.fine.regexner: Read 585573 unique entries from 2 files
[main] INFO CoreNLP - Starting server...
[main] INFO CoreNLP - StanfordCoreNLPServer listening at /0:0:0:0:0:0:0:0:9000

But when I run it with python script, it fail with error as below:


import os
os.environ["CORENLP_HOME"] = '/home/naive/Documents/shrikant/Dialogue_Implement/DST/stanford-corenlp-full-2018-10-05'

# Import client module
from stanza.server import CoreNLPClient


client = CoreNLPClient(be_quite=False, classpath='"/home/naive/Documents/shrikant/Dialogue_Implement/DST/stanford-corenlp-full-2018-10-05/*"', annotators=['tokenize','ssplit', 'pos', 'lemma', 'ner'], memory='16G', endpoint='http://localhost:9000')
print(client)

client.start()
#import time; time.sleep(10)

text = "Albert Einstein was a German-born theoretical physicist. He developed the theory of relativity."
print (text)
document = client.annotate(text)
print ('malviya')
print(type(document))

Error:

<stanza.server.client.CoreNLPClient object at 0x7fd296e40d68>
Starting server with command: java -Xmx4G -cp "/home/naive/Documents/shrikant/Dialogue_Implement/DST/stanford-corenlp-full-2018-10-05"/* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 60000 -threads 5 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-9a4ccb63339146d0.props -preload tokenize,ssplit,pos,lemma,ner
Albert Einstein was a German-born theoretical physicist. He developed the theory of relativity.

Traceback (most recent call last):
  File "stanza_eng.py", line 18, in <module>
    document = client.annotate(text)
  File "/home/naive/.conda/envs/torch_gpu36/lib/python3.6/site-packages/stanza/server/client.py", line 431, in annotate
    r = self._request(text.encode('utf-8'), request_properties, **kwargs)
  File "/home/naive/.conda/envs/torch_gpu36/lib/python3.6/site-packages/stanza/server/client.py", line 342, in _request
    self.ensure_alive()
  File "/home/naive/.conda/envs/torch_gpu36/lib/python3.6/site-packages/stanza/server/client.py", line 161, in ensure_alive
    raise PermanentlyFailedException("Timed out waiting for service to come alive.")
stanza.server.client.PermanentlyFailedException: Timed out waiting for service to come alive.

Python 3.6.10 asn1crypto==1.3.0 certifi==2020.4.5.1 cffi==1.14.0 chardet==3.0.4 cryptography==2.8 embeddings==0.0.8 gast==0.2.2 idna==2.9 numpy==1.18.2 protobuf==3.11.3 pycparser==2.20 pyOpenSSL==19.1.0 PySocks==1.7.1 requests==2.23.0 six==1.14.0 stanza==1.0.0 torch==1.4.0 tqdm==4.44.1 urllib3==1.25.8 vocab==0.0.4

I am unable to understand the issue here...

awaiting feedback

opened by skmalviya 31

Users from China suffer from connection issue when downloading Stanza models

Hi, there

Could you help me to trace this issue? Here is my some info:

Network is okay without limitations

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import stanza

if __name__ == '__main__':
    # https://github.com/stanfordnlp/stanza/blob/master/demo/Stanza_Beginners_Guide.ipynb
    # Note that you can use verbose=False to turn off all printed messages
    print("Downloading Chinese model...")
    stanza.download('zh', verbose=True)

    # Build a Chinese pipeline, with customized processor list and no logging, and force it to use CPU
    print("Building a Chinese pipeline...")
    zh_nlp = stanza.Pipeline('zh', processors='tokenize,lemma,pos,depparse', verbose=True, use_gpu=False)

C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\Scripts\python.exe C:/Users/mystic/JetBrains/PycharmProjects/BuildRoleRelationship4Novel/learn_stanza.py
Downloading Chinese model...
Traceback (most recent call last):
  File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\urllib3\connection.py", line 159, in _new_conn
    conn = connection.create_connection(
  File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\urllib3\util\connection.py", line 84, in create_connection
    raise err
  File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\urllib3\util\connection.py", line 74, in create_connection
    sock.connect(sa)
TimeoutError: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\urllib3\connectionpool.py", line 670, in urlopen
    httplib_response = self._make_request(
  File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\urllib3\connectionpool.py", line 381, in _make_request
    self._validate_conn(conn)
  File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\urllib3\connectionpool.py", line 976, in _validate_conn
    conn.connect()
  File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\urllib3\connection.py", line 308, in connect
    conn = self._new_conn()
  File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\urllib3\connection.py", line 171, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x000001E5A5DE7220>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\requests\adapters.py", line 439, in send
    resp = conn.urlopen(
  File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\urllib3\connectionpool.py", line 724, in urlopen
    retries = retries.increment(
  File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\urllib3\util\retry.py", line 439, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): Max retries exceeded with url: /stanfordnlp/stanza-resources/master/resources_1.0.0.json (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000001E5A5DE7220>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:/Users/mystic/JetBrains/PycharmProjects/BuildRoleRelationship4Novel/learn_stanza.py", line 9, in <module>
    stanza.download('zh', verbose=True)
  File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\stanza\utils\resources.py", line 223, in download
    request_file(f'{DEFAULT_RESOURCES_URL}/resources_{__resources_version__}.json', os.path.join(dir, 'resources.json'))
  File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\stanza\utils\resources.py", line 83, in request_file
    download_file(url, path)
  File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\stanza\utils\resources.py", line 66, in download_file
    r = requests.get(url, stream=True)
  File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\requests\api.py", line 76, in get
    return request('get', url, params=params, **kwargs)
  File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\requests\api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\requests\sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\requests\sessions.py", line 643, in send
    r = adapter.send(request, **kwargs)
  File "C:\Users\mystic\.virtualenvs\BuildRoleRelationship4Novel\lib\site-packages\requests\adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='raw.githubusercontent.com', port=443): Max retries exceeded with url: /stanfordnlp/stanza-resources/master/resources_1.0.0.json (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000001E5A5DE7220>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'))

Process finished with exit code 1

enhancement question

opened by pplmx 30

[QUESTION] Can I run Stanza inside Docker container?

Can I run Stanza inside docker container? I Created a container, installed all the dependencies, when the interpreter reaches the call [word.lemma for sent in doc_stanza.sentences for word in sent.words] the program just freezes without errors.
question stale

opened by malakhovks 29
MWT and Pretokenized Text for Italian

Hello! I'm using Stanza for Italian and I'm trying to generate a pred file starting with a gold file. Unfortunately, if I start with pretokenized text the new pipeline doesn't read mwt tokens, so I can't have file aligned. I saw a similar question (#95), but I don't think the problem has been solved... Can anyone help me?
question

opened by manuelfavaro90 28
ValueError: substring not found
Describe the bug when use the Vietnamese's POS, there have this problem To Reproduce Steps to reproduce the behavior:

read the sentences s;

call nlp(s); 3.'ValueError: substring not found' come out then.

Environment (please complete the following information):

OS: CentOS

Python version: Python 3.6.8

Stanza version: 1.1.1

Additional context
bug fixed on dev
opened by pipiman 28
Is there an API to update existing NER models?

I have found documentation to be able to train NER models from scratch, but is there an API that'd allow one to update an existing model locally, adding both fresh text and annotations or fresh labels, onto say i2b2 or radiology?
question fixed on dev

opened by snehasaisneha 24

Error in retraining UD Arabic-PADT data

I am trying to retrian Arabic-PADT data with some corrections, but I get an error while preparing mwt Tokenization is trained just fine, but mwt, after starting, stops with an error.

[email protected]:/data# python3 -m stanza.utils.datasets.prepare_mwt_treebank UD_Arabic-PADT
2023-01-04 13:45:32 INFO: Datasets program called with:
/usr/local/lib/python3.8/dist-packages/stanza/utils/datasets/prepare_mwt_treebank.py UD_Arabic-PADT
Preparing data for UD_Arabic-PADT: ar_padt, ar
Reading from /data/UD_Arabic-PADT/ar_padt-ud-train.conllu and writing to /tmp/tmpyiumbfdm/ar_padt.train.gold.conllu
Reading from /data/UD_Arabic-PADT/ar_padt-ud-dev.conllu and writing to /tmp/tmpyiumbfdm/ar_padt.dev.gold.conllu
Reading from /data/UD_Arabic-PADT/ar_padt-ud-test.conllu and writing to /tmp/tmpyiumbfdm/ar_padt.test.gold.conllu
11766 unique MWTs found in data
2480 unique MWTs found in data
2426 unique MWTs found in data
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.8/dist-packages/stanza/utils/datasets/prepare_mwt_treebank.py", line 63, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/stanza/utils/datasets/prepare_mwt_treebank.py", line 60, in main
    common.main(process_treebank)
  File "/usr/local/lib/python3.8/dist-packages/stanza/utils/datasets/common.py", line 257, in main
    process_treebank(treebank, paths, args)
  File "/usr/local/lib/python3.8/dist-packages/stanza/utils/datasets/prepare_mwt_treebank.py", line 49, in process_treebank
    source_filename = prepare_tokenizer_treebank.mwt_name(tokenizer_dir, short_name, shard)
AttributeError: module 'stanza.utils.datasets.prepare_tokenizer_treebank' has no attribute 'mwt_name'
[email protected]:/data# python3 -m stanza.utils.datasets.prepare_mwt_treebank UD_Arabic-PADT

I am running Python 3.8 under Ubuntu 20.04 in a docker container. Stanza is installed through pip.

Any hint?

Thank you,

Giuliano

bug duplicate fixed on dev

opened by lancioni 2

1.4.0 is buggy when it comes to some dependency parsing tasks, however, 1.3.0 works correctly

I am using the dependency parser and noticed 1.4.0 has bugs that do not exist in 1.3.0. Here is an example:

If B is true and if C is false, perform D; else, perform E and perform F

in 1.3.0, 'else' is correctly detected as a child of the 'perform' coming after it; however, in 1.4.0, it is detected as a child of the 'perform' before it.

How can I force Stanza to load 1.3.0 instead of the latest version, so I can move forward with what I am doing now?
bug

opened by apsyio 2

CUDA devide-side assert is thrown unpredictably

Describe the bug I'm using Stanza to do sentence splitting and other preprocessing as a part of a machine translation pipeline. At random times, my server starts to throw errors for about half of the requests. The problem vanishes after server is restarted. The error is always the same:

File "/var/app/current/app/translator.py", line 24, in _split_sentences
  sents = self.nlp(text).sentences
File "/var/app/venv/lib/python3.8/site-packages/stanza/pipeline/core.py", line 386, in __call__
  return self.process(doc, processors)
File "/var/app/venv/lib/python3.8/site-packages/stanza/pipeline/core.py", line 382, in process
  doc = process(doc)
File "/var/app/venv/lib/python3.8/site-packages/stanza/pipeline/tokenize_processor.py", line 87, in process
  _, _, _, document = output_predictions(None, self.trainer, batches, self.vocab, None,
File "/var/app/venv/lib/python3.8/site-packages/stanza/models/tokenization/utils.py", line 273, in output_predictions
  pred = np.argmax(trainer.predict(batch), axis=2)
File "/var/app/venv/lib/python3.8/site-packages/stanza/models/tokenization/trainer.py", line 66, in predict
  units = units.cuda()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Most of times there are no errors. Since the errors happen in production and at random times, I haven't been able to reproduce them or debug them properly. I'm unsure how I should proceed.

To Reproduce I don't know how to reproduce this, as it happens randomly.

My code is something like this (non-relevant parts redacted):

def __init__(self, source_lang: str, target_lang: str):
        self.nlp = stanza.Pipeline(lang=source_lang, processors="tokenize")
        # ...

def _split_sentences(self, text: str):
        sents = self.nlp(text).sentences
        # other processing ...

Only one stanza.Pipeline object is created by the server process.

Expected behavior There should be no errors.

Environment (please complete the following information): The server is an Amazon EC2 instance.

OS: Amazon Linux 2/3.3.16
Python version: Python 3.8 running on 64bit
Stanza version: 1.4.0

bug

opened by fergusq 4

Provide a list of bracket symbols?

Could you kindly provide a list of bracket symbols you use in the constituency module? I know it's from Penn Treebank but it's very hard to find a complete list. E.g., most online sources don't have NML. And I'm not sure what the separated dashes are noted in the model output.

opened by ningkko 1
[QUESTION] Models of Old Church Slavonic and encoding

I am using Stanza to analyze Old Church Slavonic texts and extract lemmata and dependencies. Therefore, I wonder what resources (texts) were used to build pretrained models and how many. Is it possible to enhance lemmata manually, for example, if some changes are necessary ?

There is a problem with how to encode Old Church Slavonic words -- there is not only an alphabet to consider but also diacritic symbols. What approach do you use?
question

opened by osherenko 1

Releases(v1.4.2)

v1.4.2(Sep 15, 2022)
Stanza v1.4.2: Minor version bump to improve (python) dependencies

Pipeline cache in Multilingual is a single OrderedDict https://github.com/stanfordnlp/stanza/issues/1115#issuecomment-1239759362 https://github.com/stanfordnlp/stanza/commit/ba3f64d5f571b1dc70121551364fc89d103ca1cd

Don't require pytest for all installations unless needed for testing https://github.com/stanfordnlp/stanza/issues/1120 https://github.com/stanfordnlp/stanza/commit/8c1d9d80e2e12729f60f05b81e88e113fbdd3482

hide SiLU and Minh imports if the version of torch installed doesn't have those nonlinearities https://github.com/stanfordnlp/stanza/issues/1120 https://github.com/stanfordnlp/stanza/commit/6a90ad4bacf923c88438da53219c48355b847ed3

Reorder & normalize installations in setup.py https://github.com/stanfordnlp/stanza/pull/1124

Source code(tar.gz)
Source code(zip)
v1.4.1(Sep 14, 2022)
Stanza v1.4.1: Improvements to pos, conparse, and sentiment, jupyter visualization, and wider language coverage

Overview

We improve the quality of the POS, constituency, and sentiment models, add an integration to displaCy, and add new models for a variety of languages.

New NER models

New Polish NER model based on NKJP from Karol Saputa and ryszardtuora https://github.com/stanfordnlp/stanza/issues/1070 https://github.com/stanfordnlp/stanza/pull/1110

Make GermEval2014 the default German NER model, including an optional Bert version https://github.com/stanfordnlp/stanza/issues/1018 https://github.com/stanfordnlp/stanza/pull/1022

Japanese conversion of GSD by Megagon https://github.com/stanfordnlp/stanza/pull/1038

Marathi NER dataset from L3Cube. Includes a Sentiment model as well https://github.com/stanfordnlp/stanza/pull/1043

Thai conversion of LST20 https://github.com/stanfordnlp/stanza/commit/555fc0342decad70f36f501a7ea1e29fa0c5b317

Kazakh conversion of KazNERD https://github.com/stanfordnlp/stanza/pull/1091/commits/de6cd25c2e5b936bc4ad2764b7b67751d0b862d7

Other new models

Sentiment conversion of Tass2020 for Spanish https://github.com/stanfordnlp/stanza/pull/1104

VIT constituency dataset for Italian https://github.com/stanfordnlp/stanza/pull/1091/commits/149f1440dc32d47fbabcc498cfcd316e53aca0c6 ... and many subsequent updates

Combined UD models for Hebrew https://github.com/stanfordnlp/stanza/issues/1109 https://github.com/stanfordnlp/stanza/commit/e4fcf003feb984f535371fb91c9e380dd187fd12

For UD models with small train dataset & larger test dataset, flip the datasets UD_Buryat-BDT UD_Kazakh-KTB UD_Kurmanji-MG UD_Ligurian-GLT UD_Upper_Sorbian-UFAL https://github.com/stanfordnlp/stanza/issues/1030 https://github.com/stanfordnlp/stanza/commit/9618d60d63c49ec1bfff7416e3f1ad87300c7073

Spanish conparse model from multiple sources - AnCora, LDC-NW, LDC-DF https://github.com/stanfordnlp/stanza/commit/47740c6252a6717f12ef1fde875cf19fa1cd67cc

Model improvements

Pretrained charlm integrated into POS. Gives a small to decent gain for most languages without much additional cost https://github.com/stanfordnlp/stanza/pull/1086

Pretrained charlm integrated into Sentiment. Improves English, others not so much https://github.com/stanfordnlp/stanza/pull/1025

LSTM, 2d maxpool as optional items in the Sentiment from the paper Text Classification Improved by Integrating Bidirectional LSTM with Two-dimensional Max Pooling https://github.com/stanfordnlp/stanza/pull/1098

First learn with AdaDelta, then with another optimizer in conparse training. Very helpful https://github.com/stanfordnlp/stanza/commit/b1d10d3bdd892c7f68d2da7f4ba68a6ae3087f52

Grad clipping in conparse training https://github.com/stanfordnlp/stanza/commit/365066add019096332bcba0da4a626f68b70d303

Pipeline interface improvements

GPU memory savings: charlm reused between different processors in the same pipeline https://github.com/stanfordnlp/stanza/pull/1028

Word vectors not saved in the NER models. Saves bandwidth & disk space https://github.com/stanfordnlp/stanza/pull/1033

Functions to return tagsets for NER and conparse models https://github.com/stanfordnlp/stanza/issues/1066 https://github.com/stanfordnlp/stanza/pull/1073 https://github.com/stanfordnlp/stanza/commit/36b84db71f19e37b36119e2ec63f89d1e509acb0 https://github.com/stanfordnlp/stanza/commit/2db43c834bc8adbb8b096cf135f0fab8b8d886cb

displaCy integration with NER and dependency trees https://github.com/stanfordnlp/stanza/commit/20714137d81e5e63d2bcee420b22c4fd2a871306

Bugfixes

Fix that it takes forever to tokenize a single long token (catastrophic backtracking in regex) TY to Sk Adnan Hassan (VT) and Zainab Aamir (Stony Brook) https://github.com/stanfordnlp/stanza/pull/1056

Starting a new corenlp client w/o server shouldn't wait for the server to be available TY to Mariano Crosetti https://github.com/stanfordnlp/stanza/issues/1059 https://github.com/stanfordnlp/stanza/pull/1061

Read raw glove word vectors (they have no header information) https://github.com/stanfordnlp/stanza/pull/1074

Ensure that illegal languages are not chosen by the LangID model https://github.com/stanfordnlp/stanza/issues/1076 https://github.com/stanfordnlp/stanza/pull/1077

Fix cache in Multilingual pipeline https://github.com/stanfordnlp/stanza/issues/1115 https://github.com/stanfordnlp/stanza/commit/cdf18d8b19c92b0cfbbf987e82b0080ea7b4db32

Fix loading of previously unseen languages in Multilingual pipeline https://github.com/stanfordnlp/stanza/issues/1101 https://github.com/stanfordnlp/stanza/commit/e551ebe60a4d818bc5ba8880dda741cc8bd1aed7

Fix that conparse would occasionally train to NaN early in the training https://github.com/stanfordnlp/stanza/commit/c4d785729e42ac90f298e0ef4ab487d14fa35591

Improved training tools

W&B integration for all models: can be activated with --wandb flag in the training scripts https://github.com/stanfordnlp/stanza/pull/1040

New webpages for building charlm, NER, and Sentiment https://stanfordnlp.github.io/stanza/new_language_charlm.html https://stanfordnlp.github.io/stanza/new_language_ner.html https://stanfordnlp.github.io/stanza/new_language_sentiment.html

Script to download Oscar 2019 data for charlm from HF (requires datasets module) https://github.com/stanfordnlp/stanza/pull/1014

Unify sentiment training into a Python script, replacing the old shell script https://github.com/stanfordnlp/stanza/pull/1021 https://github.com/stanfordnlp/stanza/pull/1023

Convert sentiment to use .json inputs. In particular, this helps with languages with spaces in words such as Vietnamese https://github.com/stanfordnlp/stanza/pull/1024

Slightly faster charlm training https://github.com/stanfordnlp/stanza/pull/1026

Data conversion of WikiNER generalized for retraining / add new WikiNER models https://github.com/stanfordnlp/stanza/pull/1039

XPOS factory now determined at start of POS training. Makes addition of new languages easier https://github.com/stanfordnlp/stanza/pull/1082

Checkpointing and continued training for charlm, conparse, sentiment https://github.com/stanfordnlp/stanza/pull/1090 https://github.com/stanfordnlp/stanza/commit/0e6de808eacf14cd64622415eeaeeac2d60faab2 https://github.com/stanfordnlp/stanza/commit/e5793c9dd5359f7e8f4fe82bf318a2f8fd190f54

Option to write the results of a NER model to a file https://github.com/stanfordnlp/stanza/pull/1108

Add fake dependencies to a conllu formatted dataset for better integration with evaluation tools https://github.com/stanfordnlp/stanza/commit/6544ef3fa5e4f1b7f06dbcc5521fbf9b1264197a

Convert an AMT NER result to Stanza .json https://github.com/stanfordnlp/stanza/commit/cfa7e496ca7c7662478e03c5565e1b2b2c026fad

Add a ton of language codes, including 3 letter codes for languages we generally treat as 2 letters https://github.com/stanfordnlp/stanza/commit/5a5e9187f81bd76fcd84ad713b51215b64234986 https://github.com/stanfordnlp/stanza/commit/b32a98e477e9972737ad64deea0bda8d6cebb4ec and others

Source code(tar.gz)
Source code(zip)
v1.4.0(Apr 23, 2022)
Stanza v1.4.0: Transformer integration to NER and conparse

Overview

As part of the new Stanza release, we integrate transformer inputs to the NER and conparse modules. In addition, we now support several additional languages for NER and conparse.

Pipeline interface improvements

Download resources.json and models into temp dirs first to avoid race conditions between multiple processors https://github.com/stanfordnlp/stanza/issues/213 https://github.com/stanfordnlp/stanza/pull/1001

Download models for Pipelines automatically, without needing to call stanza.download(...) https://github.com/stanfordnlp/stanza/issues/486 https://github.com/stanfordnlp/stanza/pull/943

Add ability to turn off downloads https://github.com/stanfordnlp/stanza/commit/68455d895986357a2c1f496e52c4e59ee0feb165

Add a new interface where both processors and package can be set https://github.com/stanfordnlp/stanza/issues/917 https://github.com/stanfordnlp/stanza/commit/f37042924b7665bbaf006b02dcbf8904d71931a1

When using pretokenized tokens, get character offsets from text if available https://github.com/stanfordnlp/stanza/issues/967 https://github.com/stanfordnlp/stanza/pull/975

If Bert or other transformers are used, cache the models rather than loading multiple times https://github.com/stanfordnlp/stanza/pull/980

Allow for disabling processors on individual runs of a pipeline https://github.com/stanfordnlp/stanza/issues/945 https://github.com/stanfordnlp/stanza/pull/947

Other general improvements

Add # text and # sent_id to conll output https://github.com/stanfordnlp/stanza/discussions/918 https://github.com/stanfordnlp/stanza/pull/983 https://github.com/stanfordnlp/stanza/pull/995

Add ner to the token conll output https://github.com/stanfordnlp/stanza/discussions/993 https://github.com/stanfordnlp/stanza/pull/996

Fix missing Slovak MWT model https://github.com/stanfordnlp/stanza/issues/971 https://github.com/stanfordnlp/stanza/commit/5aa19ec2e6bc610576bc12d226d6f247a21dbd75

Upgrades to EN, IT, and Indonesian models https://github.com/stanfordnlp/stanza/issues/1003 https://github.com/stanfordnlp/stanza/pull/1008 IT improvements with the help of @attardi and @msimi

Fix improper tokenization of Chinese text with leading whitespace https://github.com/stanfordnlp/stanza/issues/920 https://github.com/stanfordnlp/stanza/pull/924

Check if a CoreNLP model exists before downloading it (thank you @interNULL) https://github.com/stanfordnlp/stanza/pull/965

Convert the run_charlm script to python https://github.com/stanfordnlp/stanza/pull/942

Typing and lint fixes (thank you @asears) https://github.com/stanfordnlp/stanza/pull/833 https://github.com/stanfordnlp/stanza/pull/856

stanza-train examples now compatible with the python training scripts https://github.com/stanfordnlp/stanza/issues/896

NER features

Bert integration (not by default, thank you @vythaihn) https://github.com/stanfordnlp/stanza/pull/976

Swedish model (thank you @EmilStenstrom) https://github.com/stanfordnlp/stanza/issues/912 https://github.com/stanfordnlp/stanza/pull/857

Persian model https://github.com/stanfordnlp/stanza/issues/797

Danish model https://github.com/stanfordnlp/stanza/pull/910/commits/3783cc494ee8c6b6d062c4d652a428a04a4ee839

Norwegian model (both NB and NN) https://github.com/stanfordnlp/stanza/pull/910/commits/31fa23e5239b10edca8ecea46e2114f9cc7b031d

Use updated Ukrainian data (thank you @gawy) https://github.com/stanfordnlp/stanza/pull/873

Myanmar model (thank you UCSY) https://github.com/stanfordnlp/stanza/pull/845

Training improvements for finetuning models https://github.com/stanfordnlp/stanza/issues/788 https://github.com/stanfordnlp/stanza/pull/791

Fix inconsistencies in B/S/I/E tags https://github.com/stanfordnlp/stanza/issues/928#issuecomment-1027987531 https://github.com/stanfordnlp/stanza/pull/961

Add an option for multiple NER models at the same time, merging the results together https://github.com/stanfordnlp/stanza/issues/928 https://github.com/stanfordnlp/stanza/pull/955

Constituency parser

Dynamic oracle (improves accuracy a bit) https://github.com/stanfordnlp/stanza/pull/866

Missing tags now okay in the parser https://github.com/stanfordnlp/stanza/issues/862 https://github.com/stanfordnlp/stanza/commit/04dbf4f65e417a2ceb19897ab62c4cf293187c0b

bugfix of () not being escaped when output in a tree https://github.com/stanfordnlp/stanza/commit/eaf134ca699aca158dc6e706878037a20bc8cbd4

charlm integration by default https://github.com/stanfordnlp/stanza/pull/799

Bert integration (not the default model) (thank you @vythaihn and @hungbui0411) https://github.com/stanfordnlp/stanza/commit/05a0b04ee6dd701ca1c7c60197be62d4c13b17b6 https://github.com/stanfordnlp/stanza/commit/0bbe8d10f895560a2bf16f542d2e3586d5d45b7e

Preemptive bugfix for incompatible devices from @zhaochaocs https://github.com/stanfordnlp/stanza/issues/989 https://github.com/stanfordnlp/stanza/pull/1002

New models: DA, based on Arboretum IT, based on the Turin treebank JA, based on ALT PT, based on Cintil TR, based on Starlang ZH, based on CTB7

Source code(tar.gz)
Source code(zip)
v1.3.0(Oct 6, 2021)
Overview

Stanza 1.3.0 introduces a language id model, a constituency parser, a dictionary in the tokenizer, and some additional features and bugfixes.

New features

Langid model and multilingual pipeline Based on "A reproduction of Apple's bi-directional LSTM models for language identification in short strings." by Toftrup et al 2021 (https://github.com/stanfordnlp/stanza/commit/154b0e8e59d3276744ae0c8ea56dc226f777fba8)

Constituency parser Based on "In-Order Transition-based Constituent Parsing" by Jiangming Liu and Yue Zhang. Currently an en_wsj model available, with more to come. (https://github.com/stanfordnlp/stanza/commit/90318023432d584c62986123ef414a1fa93683ca)

Evalb interface to CoreNLP Useful for evaluating the parser - requires CoreNLP 4.3.0 or later

Dictonary tokenizer feature Noticeably improved performance for ZH, VI, TH (https://github.com/stanfordnlp/stanza/pull/776)

Bugfixes / Reliability

HuggingFace integration No more git issues complaining about unavailable models! (Hopefully) (https://github.com/stanfordnlp/stanza/commit/f7af5049568f81a716106fee5403d339ca246f38)

Sentiment processor crashes on certain inputs (issue https://github.com/stanfordnlp/stanza/issues/804, fixed by https://github.com/stanfordnlp/stanza/commit/e232f67f3850a32a1b4f3a99e9eb4f5c5580c019)

Source code(tar.gz)
Source code(zip)
v1.2.3(Aug 9, 2021)
Overview

In anticipation of a larger release with some new features, we make a small update to fix some existing bugs and add two more NER models.

Bugfixes

Sentiment models would crash on no text (issue https://github.com/stanfordnlp/stanza/issues/769, fixed by https://github.com/stanfordnlp/stanza/pull/781/commits/47889e3043c27f9c5abd9913016929f1857de7bf)

Java processes as a context were not properly closed (https://github.com/stanfordnlp/stanza/pull/781/commits/a39d2ff6801a23aa73add1f710d809a9c0a793b1)

Interface improvements

Downloading tokenize now downloads mwt for languages which require it (issue https://github.com/stanfordnlp/stanza/issues/774, fixed by https://github.com/stanfordnlp/stanza/pull/777, from davidrft)

NER model can finetune and save to/from different filenames (https://github.com/stanfordnlp/stanza/pull/781/commits/0714a0134f0af6ef486b49ce934f894536e31d43)

NER model now displays a confusion matrix at the end of training (https://github.com/stanfordnlp/stanza/pull/781/commits/9bbd3f712f97cb2702a0852e1c353d4d54b4b33b)

NER models

Afrikaans, trained in NCHLT (https://github.com/stanfordnlp/stanza/pull/781/commits/6f1f04b6d674691cf9932d780da436063ebd3381)

Italian, trained on a model from FBK (https://github.com/stanfordnlp/stanza/pull/781/commits/d9a361fd7f13105b68569fddeab650ea9bd04b7f)

Source code(tar.gz)
Source code(zip)
v1.2.2(Jul 15, 2021)
Overview

A regression in NER results occurred in 1.2.1 when fixing a bug in VI models based around spaces.

Bugfixes

Fix Sentiment not loading correctly on Windows because of pickling issue (https://github.com/stanfordnlp/stanza/pull/742) (thanks to @BramVanroy)

Fix NER bulk process not filling out data structures as expected (https://github.com/stanfordnlp/stanza/issues/721) (https://github.com/stanfordnlp/stanza/pull/722)

Fix NER space issue causing a performance regression (https://github.com/stanfordnlp/stanza/issues/739) (https://github.com/stanfordnlp/stanza/pull/732)

Interface improvements

Add an NER run script (https://github.com/stanfordnlp/stanza/pull/738)

Source code(tar.gz)
Source code(zip)
v1.2.1(Jun 17, 2021)
Overview

All models other than NER and Sentiment were retrained with the new UD 2.8 release. All of the updates include the data augmentation fixes applied in 1.2.0, along with new augmentations tokenization issues and end-of-sentence issues. This release also features various enhancements, bug fixes, and performance improvements, along with 4 new NER models.

Model improvements

Add Bulgarian, Finnish, Hungarian, Vietnamese NER models

The Bulgarian model is trained on BSNLP 2019 data.

The Finnish model is trained on the Turku NER data.

The Hungarian model is trained on a combination of the NYTK dataset and earlier business and criminal NER datasets.

The Vietnamese model is trained on the VLSP 2018 data.

Furthermore, the script for preparing the lang-uk NER data has been integrated (https://github.com/stanfordnlp/stanza/commit/c1f0bee1074997d9376adaec45dc00f813d00b38)

Use new word vectors for Armenian, including better coverage for the new Western Armenian dataset(https://github.com/stanfordnlp/stanza/pull/718/commits/d9e8301addc93450dc880b06cb665ad10d869242)

Add copy mechanism in the seq2seq model. This fixes some unusual Spanish multi-word token expansion errors and potentially improves lemmatization performance. (https://github.com/stanfordnlp/stanza/pull/692 https://github.com/stanfordnlp/stanza/issues/684)

Fix Spanish POS and depparse mishandling a leading ¿ missing (https://github.com/stanfordnlp/stanza/pull/699 https://github.com/stanfordnlp/stanza/issues/698)

Fix tokenization breaking when a newline splits a Chinese token(https://github.com/stanfordnlp/stanza/pull/632 https://github.com/stanfordnlp/stanza/issues/531)

Fix tokenization of parentheses in Chinese(https://github.com/stanfordnlp/stanza/commit/452d842ed596bb7807e604eeb2295fd4742b7e89)

Fix various issues with characters not present in UD training data such as ellipses characters or unicode apostrophe (https://github.com/stanfordnlp/stanza/pull/719/commits/db0555253f0a68c76cf50209387dd2ff37794197 https://github.com/stanfordnlp/stanza/pull/719/commits/f01a1420755e3e0d9f4d7c2895e0261e581f7413 https://github.com/stanfordnlp/stanza/pull/719/commits/85898c50f14daed75b96eed9cd6e9d6f86e2d197)

Fix a variety of issues with Vietnamese tokenization - remove language specific model improvement which got roughly 1% F1 but caused numerous hard-to-track issues (https://github.com/stanfordnlp/stanza/pull/719/commits/3ccb132e03ce28a9061ec17d2c0ae84cc2000548)

Fix spaces in the Vietnamese words not being found in the embedding used for POS and depparse(https://github.com/stanfordnlp/stanza/pull/719/commits/197212269bc33b66759855a5addb99d1f465e4f4)

Include UD_English-GUMReddit in the GUM models(https://github.com/stanfordnlp/stanza/pull/719/commits/9e6367cb9bdd635d579fd8d389cb4d5fa121c413)

Add Pronouns & PUD to the mixed English models (various data improvements made this more appealing)(https://github.com/stanfordnlp/stanza/pull/719/commits/f74bef7b2ed171bf9c027ae4dfd3a10272040a46)

Interface enhancements

Add ability to pass a Document to the pipeline in pretokenized mode(https://github.com/stanfordnlp/stanza/commit/f88cd8c2f84aedeaec34a11b4bc27573657a66e2 https://github.com/stanfordnlp/stanza/issues/696)

Track comments when reading and writing conll files (https://github.com/stanfordnlp/stanza/pull/676 originally from @danielhers in https://github.com/stanfordnlp/stanza/pull/155)

Add a proxy parameter for downloads to pass through to the requests module (https://github.com/stanfordnlp/stanza/pull/638)

add sent_idx to tokens (https://github.com/stanfordnlp/stanza/commit/ee6135c538e24ff37d08b86f34668ccb223c49e1)

Bugfixes

Fix Windows encoding issues when reading conll documents from @yanirmr (b40379eaf229e7ffc7580def57ee1fad46080261 https://github.com/stanfordnlp/stanza/pull/695)

Fix tokenization breaking when second batch is exactly eval_length(https://github.com/stanfordnlp/stanza/commit/726368644d7b1019825f915fabcfe1e4528e068e https://github.com/stanfordnlp/stanza/issues/634 https://github.com/stanfordnlp/stanza/issues/631)

Efficiency improvements

Bulk process for tokenization - greatly speeds up the use case of many small docs (https://github.com/stanfordnlp/stanza/pull/719/commits/5d2d39ec822c65cb5f60d547357ad8b821683e3c)

Optimize MWT usage in pipeline & fix MWT bulk_process (https://github.com/stanfordnlp/stanza/pull/642 https://github.com/stanfordnlp/stanza/pull/643 https://github.com/stanfordnlp/stanza/pull/644)

CoreNLP integration

Add a UD Enhancer tool which interfaces with CoreNLP's generic enhancer (https://github.com/stanfordnlp/stanza/pull/675)

Add an interface to CoreNLP tokensregex using stanza tokenization (https://github.com/stanfordnlp/stanza/pull/659)

Source code(tar.gz)
Source code(zip)
v1.2.0(Jan 29, 2021)
Overview

All models other than NER and Sentiment were retrained with the new UD 2.7 release. Quite a few of them have data augmentation fixes for problems which arise in common use rather than when running an evaluation task. This release also features various enhancements, bug fixes, and performance improvements.

New features and enhancements

Models trained on combined datasets in English and Italian The default models for English are now a combination of EWT and GUM. The default models for Italian now combine ISDT, VIT, Twittiro, PosTWITA, and a custom dataset including MWT tokens.

NER Transfer Learning Allows users to fine-tune all or part of the parameters of trained NER models on a new dataset for transfer learning (#351, thanks to @gawy for the contribution)

Multi-document support The Stanza Pipeline now supports multi-Document input! To process multiple documents without having to worry about document boundaries, simply pass a list of Stanza Document objects into the Pipeline. (https://github.com/stanfordnlp/stanza/issues/70 https://github.com/stanfordnlp/stanza/pull/577)

Added API links from token to sentence It's easier to access Stanza data objects from related ones. To access the sentence object a token or a word, simply use token.sent or word.sent. (https://github.com/stanfordnlp/stanza/issues/533 https://github.com/stanfordnlp/stanza/pull/554)

New external tokenizer for Thai with PyThaiNLP Try it out with, for example, stanza.Pipeline(lang='th', processors={'tokenize': 'pythainlp'}, package=None). (https://github.com/stanfordnlp/stanza/pull/567)

Faster tokenization We have improved how the data pipeline works internally to reduce redundant data wrangling, and significantly sped up the tokenization of long texts. If you have a really long line of text, you could experience up to 10x speedup or more without changing anything. (#522)

Added a method for getting all the supported languages from the resources file Wondering what languages Stanza supports and want to determine it programmatically? Wonder no more! Try stanza.resources.common.list_available_languages(). (https://github.com/stanfordnlp/stanza/issues/511 https://github.com/stanfordnlp/stanza/commit/fa52f8562f20ab56807b35ba204d6f9ca60b47ab)

Load mwt automagically if a model needs it Multi-word token expansion is one of the most common things to miss from your Pipeline instantiation, and remembering to include it is a pain -- until now. (https://github.com/stanfordnlp/stanza/pull/516 https://github.com/stanfordnlp/stanza/issues/515 and many others)

Vietnamese sentiment model based on VSFC This is now part of the default language package for Vietnamese that you get from stanza.download("vi"). Enjoy!

More informative errors for missing models Stanza now throws more helpful exceptions with informative exception messages when you are missing models (https://github.com/stanfordnlp/stanza/pull/437 https://github.com/stanfordnlp/stanza/issues/430 ... https://github.com/stanfordnlp/stanza/issues/324 https://github.com/stanfordnlp/stanza/pull/438 ... https://github.com/stanfordnlp/stanza/issues/529 https://github.com/stanfordnlp/stanza/commit/953966539c955951d01e3d6b4561fab02a1f546c ... https://github.com/stanfordnlp/stanza/issues/575 https://github.com/stanfordnlp/stanza/pull/578)

Bugfixes

Fixed NER documentation for German to correctly point to the GermEval 2014 model for download. (https://github.com/stanfordnlp/stanza/commit/4ee9f12be5911bb600d2f162b1684cb4686c391e https://github.com/stanfordnlp/stanza/issues/559)

External tokenization library integration respects no_ssplit so you can enjoy using them without messing up your preferred sentence segmentation just like Stanza tokenizers. (https://github.com/stanfordnlp/stanza/issues/523 https://github.com/stanfordnlp/stanza/pull/556)

Telugu lemmatizer and tokenizer improvements Telugu models set to use identity lemmatizer by default, and the tokenizer is retrained to separate sentence final punctuation (https://github.com/stanfordnlp/stanza/issues/524 https://github.com/stanfordnlp/stanza/commit/ba0aec30e6e691155bc0226e4cdbb829cb3489df)

Spanish model would not tokenize foo,bar Now fixed (https://github.com/stanfordnlp/stanza/issues/528 https://github.com/stanfordnlp/stanza/commit/123d5029303a04185c5574b76fbed27cb992cadd)

Arabic model would not tokenize asdf . Now fixed (https://github.com/stanfordnlp/stanza/issues/545 https://github.com/stanfordnlp/stanza/commit/03b7ceacf73870b2a15b46479677f4914ea48745)

Various tokenization models would split URLs and/or emails Now URLs and emails are robustly handled with regexes. (https://github.com/stanfordnlp/stanza/issues/539 https://github.com/stanfordnlp/stanza/pull/588)

Various parser and pos models would deterministically label "punct" for the final word Resolved via data augmentation (https://github.com/stanfordnlp/stanza/issues/471 https://github.com/stanfordnlp/stanza/issues/488 https://github.com/stanfordnlp/stanza/pull/491)

Norwegian tokenizers retrained to separate final punct The fix is an upstream data fix (https://github.com/stanfordnlp/stanza/issues/305 https://github.com/UniversalDependencies/UD_Norwegian-Bokmaal/pull/5)

Bugfix for conll eval Fix the error in data conversion from python object of Document to CoNLL format. (https://github.com/stanfordnlp/stanza/pull/484 https://github.com/stanfordnlp/stanza/issues/483, thanks @m0re4u )

Less randomness in sentiment results Fixes prediction fluctuation in sentiment prediction. (https://github.com/stanfordnlp/stanza/issues/458 https://github.com/stanfordnlp/stanza/commit/274474c3b0e4155ab6e221146ac347ca433f81a6)

Bugfix which should make it easier to use in jupyter / colab This fixes the issue where jupyter notebooks (and by extension colab) don't like it when you use sys.stderr as the stderr of popen (https://github.com/stanfordnlp/stanza/pull/434 https://github.com/stanfordnlp/stanza/issues/431)

Misc fixes for training, concurrency, and edge cases in basic Pipeline usage

Fix for mwt training (https://github.com/stanfordnlp/stanza/pull/446)

Fix for race condition in seq2seq models (https://github.com/stanfordnlp/stanza/pull/463 https://github.com/stanfordnlp/stanza/issues/462)

Fix for race condition in CRF (https://github.com/stanfordnlp/stanza/pull/566 https://github.com/stanfordnlp/stanza/issues/561)

Fix for empty text in pipeline (https://github.com/stanfordnlp/stanza/pull/475 https://github.com/stanfordnlp/stanza/issues/474)

Fix for resources not freed when downloading (https://github.com/stanfordnlp/stanza/issues/502 https://github.com/stanfordnlp/stanza/pull/503)

Fix for vietnamese pipeline not working (https://github.com/stanfordnlp/stanza/issues/531 https://github.com/stanfordnlp/stanza/pull/535)

BREAKING CHANGES

Renamed stanza.models.tokenize -> stanza.models.tokenization https://github.com/stanfordnlp/stanza/pull/452 This stops the tokenize directory shadowing a built in library

Source code(tar.gz)
Source code(zip)
v1.1.1(Aug 13, 2020)
Overview

This release features support for extending the capability of the Stanza pipeline with customized processors, a new sentiment analysis tool, improvements to the CoreNLPClient functionality, new models for a few languages (including Thai, which is supported for the first time in Stanza), new biomedical and clinical English packages, alternative servers for downloading resource files, and various improvements and bugfixes.

New Features and Enhancements

New Sentiment Analysis Models for English, German, Chinese: The default Stanza pipelines for English, German and Chinese now include sentiment analysis models. The released models are based on a convolutional neural network architecture, and predict three-way sentiment labels (negative/neutral/positive). For more information and details on the datasets used to train these models and their performance, please visit the Stanza website.

New Biomedical and Clinical English Model Packages: Stanza now features syntactic analysis and named entity recognition functionality for English biomedical literature text and clinical notes. These newly introduced packages include: 2 individual biomedical syntactic analysis pipelines, 8 biomedical NER models, 1 clinical syntactic pipelines and 2 clinical NER models. For detailed information on how to download and use these pipelines, please visit Stanza's biomedical models page.

Support for Adding User Customized Processors via Python Decorators: Stanza now supports adding customized processors or processor variants (i.e., an alternative of existing processors) into existing pipelines. The name and implementation of the added customized processors or processor variants can be specified via @register_processor or @register_processor_variant decorators. See Stanza website for more information and examples (see custom Processors and Processor variants). (PR https://github.com/stanfordnlp/stanza/pull/322)

Support for Editable Properties For Data Objects: We have made it easier to extend the functionality of the Stanza neural pipeline by adding new annotations to Stanza's data objects (e.g., Document, Sentence, Token, etc). Aside from the annotation they already support, additional annotation can be easily attached through data_object.add_property(). See our documentation for more information and examples. (PR https://github.com/stanfordnlp/stanza/pull/323)

Support for Automated CoreNLP Installation and CoreNLP Model Download: CoreNLP can now be easily downloaded in Stanza with stanza.install_corenlp(dir='path/to/corenlp/installation'); CoreNLP models can now be downloaded with stanza.download_corenlp_models(model='english', version='4.1.0', dir='path/to/corenlp/installation'). For more details please see the Stanza website. (PR https://github.com/stanfordnlp/stanza/pull/363)

Japanese Pipeline Supports SudachiPy as External Tokenizer: You can now use the SudachiPy library as tokenizer in a Stanza Japanese pipeline. Turn on this when building a pipeline with nlp = stanza.Pipeline('ja', processors={'tokenize': 'sudachipy'}. Note that this will require a separate installation of the SudachiPy library via pip. (PR https://github.com/stanfordnlp/stanza/pull/365)

New Alternative Server for Stable Download of Resource Files: Users in certain areas of the world that do not have stable access to GitHub servers can now download models from alternative Stanford server by specifying a new resources_url argument. For example, stanza.download(lang='en', resources_url='stanford') will now download the resource file and English pipeline from Stanford servers. (Issue https://github.com/stanfordnlp/stanza/issues/331, PR https://github.com/stanfordnlp/stanza/pull/356)

CoreNLPClient Supports New Multiprocessing-friendly Mechanism to Start the CoreNLP Server: The CoreNLPClient now supports a new Enum values with better semantics for its start_server argument for finer-grained control over how the server is launched, including a new option called StartServer.TRY_START that launches the CoreNLP Server if one isn't running already, but doesn't fail if one has already been launched. This option makes it easier for CoreNLPClient to be used in a multiprocessing environment. Boolean values are still supported for backward compatibility, but we recommend StartServer.FORCE_START and StartSerer.DONT_START for better readability. (PR https://github.com/stanfordnlp/stanza/pull/302)

New Semgrex Interface in CoreNLP Client for Dependency Parses of Arbitrary Languages: Stanford CoreNLP has a module which allows searches over dependency graphs using a regex-like language. Previously, this was only usable for languages which CoreNLP already supported dependency trees. This release expands it to dependency graphs for any language. (Issue https://github.com/stanfordnlp/stanza/issues/399, PR https://github.com/stanfordnlp/stanza/pull/392)

New Tokenizer for Thai Language: The available UD data for Thai is quite small. The authors of pythainlp helped provide us two tokenization datasets, Orchid and Inter-BEST. Future work will include POS, NER, and Sentiment. (Issue https://github.com/stanfordnlp/stanza/issues/148)

Support for Serialization of Document Objects: Now you can serialize and deserialize the entire document by running serialized_string = doc.to_serialized() and doc = Document.from_serialized(serialized_string). The serialized string can be decoded into Python objects by running objs = pickle.loads(serialized_string). (Issue https://github.com/stanfordnlp/stanza/issues/361, PR https://github.com/stanfordnlp/stanza/pull/366)

Improved Tokenization Speed: Previously, the tokenizer was the slowest member of the neural pipeline, several times slower than any of the other processors. This release brings it in line with the others. The speedup is from improving the text processing before the data is passed to the GPU. (Relevant commits: https://github.com/stanfordnlp/stanza/commit/546ed13563c3530b414d64b5a815c0919ab0513a, https://github.com/stanfordnlp/stanza/commit/8e2076c6a0bc8890a54d9ed6931817b1536ae33c, https://github.com/stanfordnlp/stanza/commit/7f5be823a587c6d1bec63d47cd22818c838901e7, etc.)

User provided Ukrainian NER model: We now have a model built from the lang-uk NER dataset, provided by a user for redistribution.

Breaking Interface Changes

Token.id is Tuple and Word.id is Integer: The id attribute for a token will now return a tuple of integers to represent the indices of the token (or a singleton tuple in the case of a single-word token), and the id for a word will now return an integer to represent the word index. Previously both attributes are encoded as strings and requires manual conversion for downstream processing. This change brings more convenient handling of these attributes. (Issue: https://github.com/stanfordnlp/stanza/issues/211, PR: https://github.com/stanfordnlp/stanza/pull/357)

Changed Default Pipeline Packages for Several Languages for Improved Robustness: Languages that have changed default packages include: Polish (default is now PDB model, from previous LFG, https://github.com/stanfordnlp/stanza/issues/220), Korean (default is now GSD, from previous Kaist, https://github.com/stanfordnlp/stanza/issues/276), Lithuanian (default is now ALKSNIS, from previous HSE, https://github.com/stanfordnlp/stanza/issues/415).

CoreNLP 4.1.0 is required: CoreNLPClient requires CoreNLP 4.1.0 or a later version. The client expects recent modifications that were made to the CoreNLP server.

Properties Cache removed from CoreNLP client: The properties_cache has been removed from CoreNLPClient and the CoreNLPClient's annotate() method no longer has a properties_key argument. Python dictionaries with custom request properties should be directly supplied to annotate() via the properties argument.

Bugfixes and Other Improvements

Fixed Logging Behavior: This is mainly for fixing the issue that Stanza will override the global logging setting in Python and influence downstream logging behaviors. (Issue https://github.com/stanfordnlp/stanza/issues/278, PR https://github.com/stanfordnlp/stanza/pull/290)

Compatibility Fix for PyTorch v1.6.0: We've updated several processors to adapt to new API changes in PyTorch v1.6.0. (Issues https://github.com/stanfordnlp/stanza/issues/412 https://github.com/stanfordnlp/stanza/issues/417, PR https://github.com/stanfordnlp/stanza/pull/406)

Improved Batching for Long Sentences in Dependency Parser: This is mainly for fixing an issue where long sentences will cause an out of GPU memory issue in the dependency parser. (Issue https://github.com/stanfordnlp/stanza/issues/387)

Improved neural tokenizer robustness to whitespaces: the neural tokenizer is now more robust to the presence of multiple consecutive whitespace characters (PR https://github.com/stanfordnlp/stanza/pull/380)

Resolved properties issue when switching languages with requests to CoreNLP server: An issue with default properties has been resolved. Users can now switch between CoreNLP supported languages with and get expected properties for each language by default.

Source code(tar.gz)
Source code(zip)
v1.0.1(Apr 27, 2020)
Overview

This is a maintenance release of Stanza. It features new support for jieba as Chinese tokenizer, faster lemmatizer implementation, improved compatibility with CoreNLP v4.0.0, and many more!

Enhancements

Supporting jieba library as Chinese tokenizer. The Stanza (simplified and traditional) Chinese pipelines now support using the jieba Chinese word segmentation library as tokenizer. Turn on this feature in a pipeline with: nlp = stanza.Pipeline('zh', processors={'tokenize': 'jieba'}, or by specifying argument tokenize_with_jieba=True.

Setting resource directory with environment variable. You can now override the default model location $HOME/stanza_resources by setting an environmental variable STANZA_RESOURCES_DIR (https://github.com/stanfordnlp/stanza/issues/227). The new directory will then be used to store and look up model files. Thanks to @dhpollack for implementing this feature.

Faster lemmatizer implementation. The lemmatizer implementation has been improved to be about 3x faster on CPU and 5x faster on GPU (https://github.com/stanfordnlp/stanza/issues/249). Thanks to @mahdiman for identifying the original issue.

Improved compatibility with CoreNLP 4.0.0. The client is now fully compatible with the latest v4.0.0 release of the CoreNLP package.

Bugfixes

Correct character offsets in NER outputs from pre-tokenized text. We fixed an issue where the NER outputs from pre-tokenized text may be off-by-one (https://github.com/stanfordnlp/stanza/issues/229). Thanks to @RyanElliott10 for reporting the issue.

Correct Vietnamese tokenization on sentences beginning with punctuation. We fixed an issue where the Vietnamese tokenizer may throw an AssertionError on sentences that begin with a punctuation (https://github.com/stanfordnlp/stanza/issues/217). Thanks to @aryamccarthy for reporting this issue.

Correct pytorch version requirement. Stanza is now asking for pytorch>=1.3.0 to avoid a runtime error raised by pytorch ((https://github.com/stanfordnlp/stanza/issues/231)). Thanks to @Vodkazy for reporting this.

Known Model Issues & Solutions

Default Korean Kaist tokenizer failing on punctuation. The default Korean Kaist model is reported to have issues with separating punctuations during tokenization (https://github.com/stanfordnlp/stanza/issues/276). Switching to the Korean GSD model may solve this issue.

Default Polish LFG POS tagger incorrectly labeling last word in sentence as PUNCT. The default Polish model trained on the LFG treebank may incorrectly tag the last word in a sentence as PUNCT (https://github.com/stanfordnlp/stanza/issues/220). This issue may be solved by switching to the Polish PDB model.

Source code(tar.gz)
Source code(zip)
1.0.1(Apr 27, 2020)
Overview

This is a maintenance release of Stanza. It features new support for jieba as Chinese tokenizer, faster lemmatizer implementation, improved compatibility with CoreNLP v4.0.0, and many more!

Enhancements

Supporting jieba library as Chinese tokenizer. The Stanza (simplified and traditional) Chinese pipelines now support using the jieba Chinese word segmentation library as tokenizer. Turn on this feature in a pipeline with: nlp = stanza.Pipeline('zh', processors={'tokenize': 'jieba'}, or by specifying argument tokenize_with_jieba=True.

Setting resource directory with environment variable. You can now override the default model location $HOME/stanza_resources by setting an environmental variable STANZA_RESOURCES_DIR (https://github.com/stanfordnlp/stanza/issues/227). The new directory will then be used to store and look up model files. Thanks to @dhpollack for implementing this feature.

Faster lemmatizer implementation. The lemmatizer implementation has been improved to be about 3x faster on CPU and 5x faster on GPU (https://github.com/stanfordnlp/stanza/issues/249). Thanks to @mahdiman for identifying the original issue.

Improved compatibility with CoreNLP 4.0.0. The client is now fully compatible with the latest v4.0.0 release of the CoreNLP package.

Bugfixes

Correct character offsets in NER outputs from pre-tokenized text. We fixed an issue where the NER outputs from pre-tokenized text may be off-by-one (https://github.com/stanfordnlp/stanza/issues/229). Thanks to @RyanElliott10 for reporting the issue.

Correct Vietnamese tokenization on sentences beginning with punctuation. We fixed an issue where the Vietnamese tokenizer may throw an AssertionError on sentences that begin with a punctuation (https://github.com/stanfordnlp/stanza/issues/217). Thanks to @aryamccarthy for reporting this issue.

Correct pytorch version requirement. Stanza is now asking for pytorch>=1.3.0 to avoid a runtime error raised by pytorch ((https://github.com/stanfordnlp/stanza/issues/231)). Thanks to @Vodkazy for reporting this.

Known Model Issues & Solutions

Default Korean Kaist tokenizer failing on punctuation. The default Korean Kaist model is reported to have issues with separating punctuations during tokenization (https://github.com/stanfordnlp/stanza/issues/276). Switching to the Korean GSD model may solve this issue.

Default Polish LFG POS tagger incorrectly labeling last word in sentence as PUNCT. The default Polish model trained on the LFG treebank may incorrectly tag the last word in a sentence as PUNCT (https://github.com/stanfordnlp/stanza/issues/220). This issue may be solved by switching to the Polish PDB model.

Source code(tar.gz)
Source code(zip)
v1.0.0(Mar 17, 2020)
Overview

This is the first major release of Stanza (previously known as StanfordNLP), a software package to process many human languages. The main features of this release are

Multi-lingual named entity recognition support. Stanza supports named entity recognition in 8 languages (and 12 datasets): Arabic, Chinese, Dutch, English, French, German, Russian, and Spanish. The most comprehensive NER models in each language is now part of the default model download of that model, along with other models trained on the largest dataset available.

Accurate neural network models. Stanza features highly accurate data-driven neural network models for a wide collection of natural language processing tasks, including tokenization, sentence segmentation, part-of-speech tagging, morphological feature tagging, dependency parsing, and named entity recognition.

State-of-the-art pretrained models freely available. Stanza features a few hundred pretrained models for 60+ languages, all freely availble and easily downloadable from native Python code. Most of these models achieve state-of-the-art (or competitive) performance on these tasks.

Expanded language support. Stanza now supports more than 60 human languages, representing a wide-range of language families.

Easy-to-use native Python interface. We've improved the usability of the interface to maximize transparency. Now intermediate processing results are more easily viewed and accessed as native Python objects.

Anaconda support. Stanza now officially supports installation from Anaconda. You can install Stanza through Stanford NLP Group's Anaconda channel conda install -c stanfordnlp stanza.

Improved documentation. We have improved our documentation to include a comprehensive coverage of the basic and advanced functionalities supported by Stanza.

Improved CoreNLP support in Python. We have improved the robustness and efficiency of the CoreNLPClient to access the Java CoreNLP software from Python code. It is also forward compatible with the next major release of CoreNLP.

Enhancements and Bugfixes

This release also contains many enhancements and bugfixes:

[Enhancement] Improved lemmatization support with proper conditioning on POS tags (#143). Thanks to @nljubesi for the report!

[Enhancement] Get the text corresponding to sentences in the document. Access it through sentence.text. (#80)

[Enhancement] Improved logging. Stanza now uses Python's logging for all procedual logging, which can be controlled globally either through logging_level or a verbose shortcut. See this page for more information. (#81)

[Enhancement] Allow the user to use the Stanza tokenizer with their own sentence split, which might be useful for applications like machine translation. Simply set tokenize_no_ssplit to True at pipeline instantiation. (#108)

[Enhancement] Support running the dependency parser only given tokenized, sentence segmented, and POS/morphological feature tagged data. Simply set depparse_pretagged to True at pipeline instantiation. (#141) Thanks @mrapacz for the contribution!

[Enhancement] Added spaCy as an option for tokenizing (and sentence segmenting) English text for efficiency. See this documentation page for a quick example.

[Enhancement] Add character offsets to tokens, sentences, and spans.

[Bugfix] Correctly decide whether to load pretrained embedding files given training flags. (#120)

[Bugfix] Google proto buffers reporting errors for long input when using the CoreNLPClient. (#154)

[Bugfix] Remove deprecation warnings from newer versions of PyTorch. (#162)

Breaking Changes

Note that if your code was developed on a previous version of the package, there are potentially many breaking changes in this release. The most notable changes are in the Document objects, which contain all the annotations for the raw text or document fed into the Stanza pipeline. The underlying implementation of Document and all related data objects have broken away from using the CoNLL-U format as its internal representation for more flexibility and efficiency accessing their attributes, although it is still compatible with CoNLL-U to maintain ease of conversion between the two. Moreover, many properties have been renamed for clarity and sometimes aliased for ease of access. Please see our documentation page about these data objects for more information.
Source code(tar.gz)
Source code(zip)
v0.2.0(May 16, 2019)
This release features major improvements on memory efficiency and speed of the neural network pipeline in stanfordnlp and various bugfixes. These features include:

The downloadable pretrained neural network models are now substantially smaller in size (due to the use of smaller pretrained vocabularies) with comparable performance. Notably, the default English model is now ~9x smaller in size, German ~11x, French ~6x and Chinese ~4x. As a result, memory efficiency of the neural pipelines for most languages are substantially improved.

Substantial speedup of the neural lemmatizer via reduced neural sequence-to-sequence operations.

The neural network pipeline can now take in a Python list of strings representing pre-tokenized text. (https://github.com/stanfordnlp/stanfordnlp/issues/58)

A requirements checking framework is now added in the neural pipeline, ensuring the proper processors are specified for a given pipeline configuration. The pipeline will now raise an exception when a requirement is not satisfied. (https://github.com/stanfordnlp/stanfordnlp/issues/42)

Bugfix related to alignment between tokens and words post the multi-word expansion processor. (https://github.com/stanfordnlp/stanfordnlp/issues/71)

More options are added for customizing the Stanford CoreNLP server at start time, including specifying properties for the default pipeline, and setting all server options such as username/password. For more details on different options, please checkout the client documentation page.

CoreNLPClient instance can now be created with CoreNLP default language properties as:

client = CoreNLPClient(properties='chinese')

Alternatively, a properties file can now be used during the creation of a CoreNLPClient:

client = CoreNLPClient(properties='/path/to/corenlp.props')

All specified CoreNLP annotators are now preloaded by default when a CoreNLPClient instance is created. (https://github.com/stanfordnlp/stanfordnlp/issues/56)

Source code(tar.gz)
Source code(zip)
v0.1.2(Feb 26, 2019)
This is a maintenance release of stanfordnlp. This release features:

Allowing the tokenizer to treat the incoming document as pretokenized with space separated words in newline separated sentences. Set tokenize_pretokenized to True when building the pipeline to skip the neural tokenizer, and run all downstream components with your own tokenized text. (#24, #34)

Speedup in the POS/Feats tagger in evaluation (up to 2 orders of magnitude). (#18)

Various minor fixes and documentation improvements

We would also like to thank the following community members for their contribution: Code improvements: @lwolfsonkin Documentation improvements: @0xflotus And thanks to everyone that raised issues and helped improve stanfordnlp!
Source code(tar.gz)
Source code(zip)
v0.1.0(Jan 30, 2019)
The initial release of StanfordNLP. StanfordNLP is the combination of the software package used by the Stanford team in the CoNLL 2018 Shared Task on Universal Dependency Parsing, and the group’s official Python interface to the Stanford CoreNLP software. This package is built with highly accurate neural network components that enables efficient training and evaluation with your own annotated data. The modules are built on top of PyTorch (v1.0.0).

StanfordNLP features:

Native Python implementation requiring minimal efforts to set up;

Full neural network pipeline for robust text analytics, including tokenization, multi-word token (MWT) expansion, lemmatization, part-of-speech (POS) and morphological features tagging and dependency parsing;

Pretrained neural models supporting 53 (human) languages featured in 73 treebanks;

A stable, officially maintained Python interface to CoreNLP.

Source code(tar.gz)
Source code(zip)

Official Stanford NLP Python Library for Many Human Languages

Related tags

Overview

Stanza: A Python NLP Library for Many Human Languages

References

Issues and Usage Q&A

Contributing to Stanza

Installation

pip

Anaconda

From Source

Running Stanza

Getting Started with the neural pipeline

Accessing Java Stanford CoreNLP software

Online Colab Notebooks

Trained Models for the Neural Pipeline

Batching To Maximize Pipeline Speed

Training your own neural pipelines

LICENSE

Comments

Releases(v1.4.2)

v1.4.2(Sep 15, 2022)

Stanza v1.4.2: Minor version bump to improve (python) dependencies

v1.4.1(Sep 14, 2022)

Stanza v1.4.1: Improvements to pos, conparse, and sentiment, jupyter visualization, and wider language coverage

Overview

New NER models

Other new models

Model improvements

Pipeline interface improvements

Bugfixes

Improved training tools

v1.4.0(Apr 23, 2022)

Stanza v1.4.0: Transformer integration to NER and conparse

Overview

Pipeline interface improvements

Other general improvements

NER features

Constituency parser

v1.3.0(Oct 6, 2021)

Overview

New features

Bugfixes / Reliability

v1.2.3(Aug 9, 2021)

Overview

Bugfixes

Interface improvements

NER models

v1.2.2(Jul 15, 2021)

Overview

Bugfixes

Interface improvements

v1.2.1(Jun 17, 2021)

Overview

Model improvements

Interface enhancements

Bugfixes

Efficiency improvements

CoreNLP integration

v1.2.0(Jan 29, 2021)

Overview

New features and enhancements

Bugfixes

BREAKING CHANGES

v1.1.1(Aug 13, 2020)

Overview

New Features and Enhancements

Breaking Interface Changes

Bugfixes and Other Improvements

v1.0.1(Apr 27, 2020)

Overview

Enhancements

Bugfixes

Known Model Issues & Solutions

1.0.1(Apr 27, 2020)

Overview

Enhancements

Bugfixes

Known Model Issues & Solutions

v1.0.0(Mar 17, 2020)