Neural Network Models for Joint POS Tagging and Dependency Parsing

Implementations of joint models for POS tagging and dependency parsing, as described in my papers:

Dat Quoc Nguyen and Karin Verspoor. 2018. An improved neural network model for joint POS tagging and dependency parsing. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 81-91. [.bib] (jPTDP v2.0)
Dat Quoc Nguyen, Mark Dras and Mark Johnson. 2017. A Novel Neural Network Model for Joint POS Tagging and Graph-based Dependency Parsing. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 134-142. [.bib] (jPTDP v1.0)

This github project currently supports jPTDP v2.0, while v1.0 can be found in the release section. Please cite paper [1] when jPTDP is used to produce published results or incorporated into other software. I would highly appreciate to have your bug reports, comments and suggestions about jPTDP. As a free open-source implementation, jPTDP is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

Installation

jPTDP requires the following software packages:

Python 2.7

DyNet v2.0

$ virtualenv -p python2.7 .DyNet
$ source .DyNet/bin/activate
$ pip install cython numpy
$ pip install dynet==2.0.3

Once you installed the prerequisite packages above, you can clone or download (and then unzip) jPTDP. Next sections show instructions to train a new joint model for POS tagging and dependency parsing, and then to utilize a pre-trained model.

NOTE: jPTDP is also ported to run with Python 3.4+ by Santiago Castro. Also note that pre-trained models I provide in the last section would not work with this ported version (see a discussion). Thus, you may want to retrain jPTDP if using this ported version.

Train a joint model

Suppose that SOURCE_DIR is simply used to denote the source code directory. Similar to files train.conllu and dev.conllu in folder SOURCE_DIR/sample or treebanks in the Universal Dependencies (UD) project, the training and development files are formatted following 10-column data format. For training, jPTDP will only use information from columns 1 (ID), 2 (FORM), 4 (Coarse-grained POS tags---UPOSTAG), 7 (HEAD) and 8 (DEPREL).

To train a joint model for POS tagging and dependency parsing, you perform:

SOURCE_DIR$ python jPTDP.py --dynet-seed 123456789 [--dynet-mem <int>] [--epochs <int>] [--lstmdims <int>] [--lstmlayers <int>] [--hidden <int>] [--wembedding <int>] [--cembedding <int>] [--pembedding <int>] [--prevectors <path-to-pre-trained-word-embedding-file>] [--model <String>] [--params <String>] --outdir <path-to-output-directory> --train <path-to-train-file>  --dev <path-to-dev-file>

where hyper-parameters in [] are optional:

--dynet-mem: Specify DyNet memory in MB.
--epochs: Specify number of training epochs. Default value is 30.
--lstmdims: Specify number of BiLSTM dimensions. Default value is 128.
--lstmlayers: Specify number of BiLSTM layers. Default value is 2.
--hidden: Specify size of MLP hidden layer. Default value is 100.
--wembedding: Specify size of word embeddings. Default value is 100.
--cembedding: Specify size of character embeddings. Default value is 50.
--pembedding: Specify size of POS tag embeddings. Default value is 100.
--prevectors: Specify path to the pre-trained word embedding file for initialization. Default value is "None" (i.e. word embeddings are randomly initialized).
--model: Specify a name for model parameters file. Default value is "model".
--params: Specify a name for model hyper-parameters file. Default value is "model.params".
--outdir: Specify path to directory where the trained model will be saved.
--train: Specify path to the training data file.
--dev: Specify path to the development data file.

For example:

SOURCE_DIR$ python jPTDP.py --dynet-seed 123456789 --dynet-mem 1000 --epochs 30 --lstmdims 128 --lstmlayers 2 --hidden 100 --wembedding 100 --cembedding 50 --pembedding 100  --model trialmodel --params trialmodel.params --outdir sample/ --train sample/train.conllu --dev sample/dev.conllu

will produce model files trialmodel and trialmodel.params in folder SOURCE_DIR/sample.

If you would like to use the fine-grained language-specific POS tags in the 5th column instead of the coarse-grained POS tags in the 4th column, you should use swapper.py in folder SOURCE_DIR/utils to swap contents in the 4th and 5th columns:

SOURCE_DIR$ python utils/swapper.py <path-to-train-(and-dev)-file>

For example:

SOURCE_DIR$ python utils/swapper.py sample/train.conllu
SOURCE_DIR$ python utils/swapper.py sample/dev.conllu

will generate two new files for training: train.conllu.ux2xu and dev.conllu.ux2xu in folder SOURCE_DIR/sample.

Utilize a pre-trained model

Assume that you are going to utilize a pre-trained model for annotating a corpus whose each line represents a tokenized/word-segmented sentence. You should use converter.py in folder SOURCE_DIR/utils to obtain a 10-column data format of this corpus:

SOURCE_DIR$ python utils/converter.py <file-path>

For example:

SOURCE_DIR$ python utils/converter.py sample/test

will generate in folder SOURCE_DIR/sample a file named test.conllu which can be used later as input to the pre-trained model.

To utilize a pre-trained model for POS tagging and dependency parsing, you perform:

SOURCE_DIR$ python jPTDP.py --predict --model <path-to-model-parameters-file> --params <path-to-model-hyper-parameters-file> --test <path-to-10-column-input-file> --outdir <path-to-output-directory> --output <String>

--model: Specify path to model parameters file.
--params: Specify path to model hyper-parameters file.
--test: Specify path to 10-column input file.
--outdir: Specify path to directory where output file will be saved.
--output: Specify name of the output file.

For example:

SOURCE_DIR$ python jPTDP.py --predict --model sample/trialmodel --params sample/trialmodel.params --test sample/test.conllu --outdir sample/ --output test.conllu.pred
SOURCE_DIR$ python jPTDP.py --predict --model sample/trialmodel --params sample/trialmodel.params --test sample/dev.conllu --outdir sample/ --output dev.conllu.pred

will produce output files test.conllu.pred and dev.conllu.pred in folder SOURCE_DIR/sample.

Pre-trained models

Pre-trained jPTDP v2.0 models, which were trained on English WSJ Penn treebank, GENIA and UD v2.2 treebanks, can be found at HERE. Results on test sets (as detailed in paper [1]) are as follows:

Treebank	Model name	POS	UAS	LAS
English WSJ Penn treebank	model256	97.97	94.51	92.87
English WSJ Penn treebank	model	97.88	94.25	92.58

model256 and model denote the pre-trained models which use 256- and 128-dimensional LSTM hidden states, respectively, i.e. model256 is more accurate but slower.

Treebank	Code	UPOS	UAS	LAS
UD_Afrikaans-AfriBooms	af_afribooms	95.73	82.57	78.89
UD_Ancient_Greek-PROIEL	grc_proiel	96.05	77.57	72.84
UD_Ancient_Greek-Perseus	grc_perseus	88.95	65.09	58.35
UD_Arabic-PADT	ar_padt	96.33	86.08	80.97
UD_Basque-BDT	eu_bdt	93.62	79.86	75.07
UD_Bulgarian-BTB	bg_btb	98.07	91.47	87.69
UD_Catalan-AnCora	ca_ancora	98.46	90.78	88.40
UD_Chinese-GSD	zh_gsd	93.26	82.50	77.51
UD_Croatian-SET	hr_set	97.42	88.74	83.62
UD_Czech-CAC	cs_cac	98.87	89.85	87.13
UD_Czech-FicTree	cs_fictree	97.98	88.94	85.64
UD_Czech-PDT	cs_pdt	98.74	89.64	87.04
UD_Czech-PUD	cs_pud	96.71	87.62	82.28
UD_Danish-DDT	da_ddt	96.18	82.17	78.88
UD_Dutch-Alpino	nl_alpino	95.62	86.34	82.37
UD_Dutch-LassySmall	nl_lassysmall	95.21	86.46	82.14
UD_English-EWT	en_ewt	95.48	87.55	84.71
UD_English-GUM	en_gum	94.10	84.88	80.45
UD_English-LinES	en_lines	95.55	80.34	75.40
UD_English-PUD	en_pud	95.25	87.49	84.25
UD_Estonian-EDT	et_edt	96.87	85.45	82.13
UD_Finnish-FTB	fi_ftb	94.53	86.10	82.45
UD_Finnish-PUD	fi_pud	96.44	87.54	84.60
UD_Finnish-TDT	fi_tdt	96.12	86.07	82.92
UD_French-GSD	fr_gsd	97.11	89.45	86.43
UD_French-Sequoia	fr_sequoia	97.92	89.71	87.43
UD_French-Spoken	fr_spoken	94.25	79.80	73.45
UD_Galician-CTG	gl_ctg	97.12	85.09	81.93
UD_Galician-TreeGal	gl_treegal	93.66	77.71	71.63
UD_German-GSD	de_gsd	94.07	81.45	76.68
UD_Gothic-PROIEL	got_proiel	93.45	79.80	71.85
UD_Greek-GDT	el_gdt	96.59	87.52	84.64
UD_Hebrew-HTB	he_htb	96.24	87.65	82.64
UD_Hindi-HDTB	hi_hdtb	96.94	93.25	89.83
UD_Hungarian-Szeged	hu_szeged	92.07	76.18	69.75
UD_Indonesian-GSD	id_gsd	93.29	84.64	77.71
UD_Irish-IDT	ga_idt	89.74	75.72	65.78
UD_Italian-ISDT	it_isdt	98.01	92.33	90.20
UD_Italian-PoSTWITA	it_postwita	95.41	84.20	79.11
UD_Japanese-GSD	ja_gsd	97.27	94.21	92.02
UD_Japanese-Modern	ja_modern	70.53	66.88	49.51
UD_Korean-GSD	ko_gsd	93.35	81.32	76.58
UD_Korean-Kaist	ko_kaist	93.53	83.59	80.74
UD_Latin-ITTB	la_ittb	98.12	82.99	79.96
UD_Latin-PROIEL	la_proiel	95.54	74.95	69.76
UD_Latin-Perseus	la_perseus	82.36	57.21	46.28
UD_Latvian-LVTB	lv_lvtb	93.53	81.06	76.13
UD_North_Sami-Giella	sme_giella	87.48	65.79	58.09
UD_Norwegian-Bokmaal	no_bokmaal	97.73	89.83	87.57
UD_Norwegian-Nynorsk	no_nynorsk	97.33	89.73	87.29
UD_Norwegian-NynorskLIA	no_nynorsklia	85.22	64.14	54.31
UD_Old_Church_Slavonic-PROIEL	cu_proiel	93.69	80.59	73.93
UD_Old_French-SRCMF	fro_srcmf	95.12	86.65	81.15
UD_Persian-Seraji	fa_seraji	96.66	88.07	84.07
UD_Polish-LFG	pl_lfg	98.22	95.29	93.10
UD_Polish-SZ	pl_sz	97.05	90.98	87.66
UD_Portuguese-Bosque	pt_bosque	96.76	88.67	85.71
UD_Romanian-RRT	ro_rrt	97.43	88.74	83.54
UD_Russian-SynTagRus	ru_syntagrus	98.51	91.00	88.91
UD_Russian-Taiga	ru_taiga	85.49	65.52	56.33
UD_Serbian-SET	sr_set	97.40	89.32	85.03
UD_Slovak-SNK	sk_snk	95.18	85.88	81.89
UD_Slovenian-SSJ	sl_ssj	97.79	88.26	86.10
UD_Slovenian-SST	sl_sst	89.50	66.14	58.13
UD_Spanish-AnCora	es_ancora	98.57	90.30	87.98
UD_Swedish-LinES	sv_lines	95.51	83.60	78.97
UD_Swedish-PUD	sv_pud	92.10	79.53	74.53
UD_Swedish-Talbanken	sv_talbanken	96.55	86.53	83.01
UD_Turkish-IMST	tr_imst	92.93	70.53	62.55
UD_Ukrainian-IU	uk_iu	95.24	83.47	79.38
UD_Urdu-UDTB	ur_udtb	93.35	86.74	80.44
UD_Uyghur-UDT	ug_udt	87.63	76.14	63.37
UD_Vietnamese-VTB	vi_vtb	87.63	67.72	58.27

Low POS in WSJ

Hi , I tested on the WSJ dataset with model256 and only got accuracy about 95.5%. I would like to ask that how can i get the accuracy 97.97 of the paper. I used the parameters set in the code, no changes were made.

opened by ava-YangL 3
learner.py Word dropout

Seems in lines 252-259 of learner.py, you still consider the character embeddings while the word is potentially dropped. Not sure if this makes sense.

opened by TheElephantInTheRoom 2
Named Entity Recognition tool ?!

Salutation Sir... that was a great job and a very powerful PoS tool I wanted to ask you if you developed a "named entity recognition" or as they name it "chunking" tool with this PoS tool. I need it in my experiments
thanks in advance

opened by Raki22 1
Low UAS and LAS scores

I have tried using your parser to test with EWT English treebank, and surprisingly UAS and LAS scores are low, around 87.50 and 84.53. I have used conll2017 shared task pretrained word embeddings. Do you think this is normal or am I doing something wrong?

opened by Eugen2525 1
trainer.update

The trainer.update here doesn't make sense.

This was trainer.update_epoch() in the original code-base of bist-parser, but since the port from Dynet v1.1 to Dynet v2, the update_epoch function is deprecated. The use for calling update_epoch was to update the learning_rate. Which is not going to happen by calling trainer.update, as far as I know.

opened by TheElephantInTheRoom 1

Neural network models for joint POS tagging and dependency parsing (CoNLL 2017-2018)

Related tags

Overview

Neural Network Models for Joint POS Tagging and Dependency Parsing

Installation

Train a joint model

Utilize a pre-trained model

Pre-trained models

Comments

Low POS in WSJ

learner.py Word dropout

Named Entity Recognition tool ?!

Low UAS and LAS scores

trainer.update

Releases(v1.0)

v1.0(Feb 28, 2018)

Owner

Dat Quoc Nguyen

COVID-19 Related NLP Papers

Module for automatic summarization of text documents and HTML pages.

Constituency Tree Labeling Tool

TFPNER: Exploration on the Named Entity Recognition of Token Fused with Part-of-Speech

Natural Language Processing for Adverse Drug Reaction (ADR) Detection

Pervasive Attention: 2D Convolutional Networks for Sequence-to-Sequence Prediction

Text-to-Speech for Belarusian language

SHAS: Approaching optimal Segmentation for End-to-End Speech Translation

Search for documents in a domain through Google. The objective is to extract metadata

A list of NLP(Natural Language Processing) tutorials built on Tensorflow 2.0.

Text-Based zombie apocalyptic decision-making game in Python

Finally decent dictionaries based on Wiktionary for your beloved eBook reader.

A versatile token stream for handwritten parsers.

PyJPBoatRace: Python-based Japanese boatrace tools 🚤

The tool to make NLP datasets ready to use

Repositório da disciplina no semestre 2021-2

The following links explain a bit the idea of semantic search and how search mechanisms work by doing retrieve and rerank

Code for ACL 2021 main conference paper "Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances".

Ongoing research training transformer language models at scale, including: BERT & GPT-2

An open-source NLP research library, built on PyTorch.