PASTRIE: A Corpus of Prepositions Annotated with Supersense Tags in Reddit International English

Related tags

Deep Learningpastrie
Overview

PASTRIE

CC BY-SA 4.0

Official release of the corpus described in the paper:

Michael Kranzlein, Emma Manning, Siyao Peng, Shira Wein, Aryaman Arora, and Nathan Schneider (2020). PASTRIE: A Corpus of Prepositions Annotated with Supersense Tags in Reddit International English [link]. Proceedings of the 14th Linguistic Annotation Workshop.


Overview

PASTRIE is a corpus of English data from Reddit annotated with preposition supersenses from the SNACS inventory.

While the data in PASTRIE is in English, it was produced by presumed speakers of four L1s:

  • English
  • French
  • German
  • Spanish

For details on how L1s were identified, see section 3.1 of Rabinovich et al. (2018).

Annotation Example

Below is an example sentence from the corpus, where annotation targets are bolded and preposition supersenses are annotated with the notation SceneRole↝Function. Together, a scene role and function are known as a construal.


Data Formats

PASTRIE is released in the following formats. We expect that most projects will be best served by one of the JSON formats.

  • .conllulex: the 19-column CoNLL-U-Lex format originally used for STREUSLE.
  • .json: a JSON representation of the CoNLL-U-Lex that does not require a CoNLL-U-Lex parser.
  • .govobj.json: an extended version of the JSON representation that contains information about each preposition's syntactic parent and object.

PASTRIE mostly follows STREUSLE with respect to the data format and SNACS annotation practice. Primary differences in the annotations are:

  • Lemmas, part-of-speech tags, and syntactic dependencies aim to follow the UD standard in both cases. They are gold in STREUSLE, versus automatic with some manual corrections in PASTRIE.
    • PASTRIE does not group together base+clitic combinations, whereas STREUSLE does (multiword tokens—where a single orthographic word contains multiple syntactic words).
    • PASTRIE does not regularly specify SpaceAfter=No to indicate alignment between the tokens and the raw text.
    • In PASTRIE, the raw text string accompanying the sentence may contain two or more consecutive spaces.
    • PASTRIE lacks enhanced dependencies.
  • Multiword expression annotations in PASTRIE are limited to expressions containing a preposition. Depending on the syntactic head, the expression may or may not have a SNACS supersense.
    • Verbal multiword expressions in PASTRIE are not subtyped in the lexcat; they all have a lexcat of V.
  • Noun and verb expressions in PASTRIE do not have supersense labels.
Comments
  • Misc. annotation errors and/or conversion script bugs

    Misc. annotation errors and/or conversion script bugs

    There are some annotations which I'm fairly sure are incorrect and are choking up the JSON conversion script. (These errors occur using the unmodified versions of all scripts taken straight from STRUESLE.) One or two might also be indicative of a bug in the conllulex2json.py file.

    1. vs mistagged as a noun--should be prep

    AssertionError: ('french-fad32caf-e595-e3cb-07bf-aaea891e53cb-02', {'lexlemma': 'versus', 'lexcat': 'CCONJ', 'ss': 'c', 'ss2': 'c', 'toknums': [3]}, {'#': 3, 'word': 'vs', 'lemma': 'versus', 'upos': 'NOUN', 'xpos': 'NN', 'feats': None, 'head': 8, 'deprel': 'nsubj', 'edeps': None, 'misc': None, 'smwe': None, 'wmwe': None, 'lextag': 'O-CCONJ-`c'})

    1. ditto

    AssertionError: ('french-fad32caf-e595-e3cb-07bf-aaea891e53cb-02', {'lexlemma': 'versus', 'lexcat': 'CCONJ', 'ss': 'c', 'ss2': 'c', 'toknums': [3]}, {'#': 3, 'word': 'vs', 'lemma': 'versus', 'upos': 'NOUN', 'xpos': 'NN', 'feats': None, 'head': 8, 'deprel': 'nsubj', 'edeps': None, 'misc': None, 'smwe': None, 'wmwe': None, 'lextag': 'O-CCONJ-`c'})

    1. Script complains about "to" in this snippet at ID=23. Not immediately clear to me what the issue is--perhaps that "to" is labeled ADP/IN? For its xpos I think it ought to be TO, not sure about its upos. Snippet:
    13      shit    shit    NOUN    NN      _       16      obl:npmod       _       _       _       _       _       _       _       _       _       _       _
    14      this    this    PRON    DT      _       16      nsubj   _       _       _       _       _       _       _       _       _       _       _
    15      can     can     AUX     MD      _       16      aux     _       _       _       _       _       _       _       _       _       _       _
    16      end     end     VERB    VB      _       4       parataxis       _       _       _       _       _       _       _       _       _       _       _
    17      right   right   ADV     RB      _       18      advmod  _       _       _       _       _       _       _       _       _       _       _
    18      now     now     ADV     RB      _       16      advmod  _       _       _       _       _       _       _       _       _       _       _
    19      if      if      SCONJ   IN      _       21      mark    _       _       _       _       _       _       _       _       _       _       _
    20      I       I       PRON    PRP     _       21      nsubj   _       _       _       _       _       _       _       _       _       _       _
    21      want    want    VERB    VBP     _       16      advcl   _       _       _       _       _       _       _       _       _       _       _
    22      it      it      PRON    PRP     _       21      obj     _       _       _       _       _       _       _       _       _       _       _
    23      to      to      ADP     IN      _       21      obl     _       _       _       _       _       `i      `i      _       _       _       _
    24      .       .       PUNCT   .       _       4       punct   _       _       _       _       _       _       _       _       _       _       _
    

    Error:

    AssertionError: ('french-a17a4340-f9c0-8fef-fa1b-1bf13879399b-02', {'lexlemma': 'to', 'lexcat': 'INF', 'ss': 'i', 'ss2': 'i', 'toknums': [23]}, {'#': 23, 'word': 'to', 'lemma': 'to', 'upos': 'ADP', 'xpos': 'IN', 'feats': None, 'head': 21, 'deprel': 'obl', 'edeps': None, 'misc': None, 'smwe': None, 'wmwe': None, 'lextag': 'O-INF-`i'})

    Relevant span of code:

                if validate_pos and upos!=lc and (upos,lc) not in {('NOUN','N'),('PROPN','N'),('VERB','V'),
                    ('ADP','P'),('ADV','P'),('SCONJ','P'),
                    ('ADP','DISC'),('ADV','DISC'),('SCONJ','DISC'),
                    ('PART','POSS')}:
                    # most often, the single-word lexcat should match its upos
                    # check a list of exceptions
                    mismatchOK = False
                    if xpos=='TO' and lc.startswith('INF'):
                        mismatchOK = True
                    elif (xpos=='TO')!=lc.startswith('INF'):
                        assert upos in ['SCONJ', "ADP"] and swe['lexlemma']=='for',(sent['sent_id'],swe,tok)
                        mismatchOK = True
    
    1. Originator as function:

    (in french-c02823ec-60bd-adce-7327-01337eb9d1c8-02) AssertionError: ('p.Originator should never be function', {'lexlemma': 'you', 'lexcat': 'PRON.POSS', 'ss': 'p.Originator', 'ss2': 'p.Originator', 'toknums': [1]})

    1. lexcat DISC with ADJ:

    AssertionError: In spanish-a25e8289-e04a-f5af-ce56-ead9faca65b1-02, single-word expression 'like' has lexcat DISC, which is incompatible with its upos ADJ

    1. "her" tagged with Possessor is incorrectly parsed as iobj and tagged as PRP instead of PRP$. Relevant snippet:
    1       My      my      PRON    PRP$    _       2       nmod:poss       _       _       _       _       _       SocialRel       Gestalt _       _       _       _
    2       grandma grandma NOUN    NN      _       3       nsubj   _       _       _       _       _       _       _       _       _       _       _
    3       had     have    VERB    VBD     _       0       root    _       _       _       _       _       _       _       _       _       _       _
    4       her     she     PRON    PRP     _       3       iobj    _       _       _       _       _       Possessor       Possessor       _       _       _       _
    5       super   super   ADV     RB      _       6       advmod  _       _       _       _       _       _       _       _       _       _       _
    6       thick   thick   ADJ     JJ      _       8       amod    _       _       _       _       _       _       _       _       _       _       _
    7       floor   floor   NOUN    NN      _       8       compound        _       _       _       _       _       _       _       _       _       _       _
    8       mats    mat     NOUN    NNS     _       3       obj     _       _       _       _       _       _       _       _       _       _       _
    9       *       *       PUNCT   NFP     _       8       punct   _       _       _       _       _       _       _       _       _       _       _
    10      over    over    ADP     IN      _       13      case    _       _       _       _       _       Locus   Locus   _       _       _       _
    11      *       *       PUNCT   NFP     _       13      punct   _       _       _       _       _       _       _       _       _       _       _
    12      the     the     DET     DT      _       13      det     _       _       _       _       _       _       _       _       _       _       _
    13      accelerator     accelerator     NOUN    NN      _       3       obl     _       _       _       _       _       _       _       _       _       _       _
    14      ,       ,       PUNCT   ,       _       3       punct   _       _       _       _       _       _       _       _       _       _       _
    

    Error:

    AssertionError: In spanish-ebba3c73-2431-c216-8f4d-d469ee8d5564-01, single-word expression 'her' has lexcat P, which is incompatible with its upos PRON

    1. "NA" is misannotated--this is NA as in North America, i.e. a PROPN/NP, but it's lemmatized as "no", and its tags are weird.

    AssertionError: ('german-35000895-1d78-c18a-01ed-f7410b9c0581-01', {'lexlemma': 'no', 'lexcat': 'ADV', 'ss': None, 'ss2': None, 'toknums': [5]}, {'#': 5, 'word': 'NA', 'lemma': 'no', 'upos': 'PART', 'xpos': 'TO', 'feats': None, 'head': 6, 'deprel': 'mark', 'edeps': None, 'misc': None, 'smwe': None, 'wmwe': None, 'lextag': 'O-ADV'})

    opened by lgessler 6
  • Prepositional supersense annotations on non-preposition targets

    Prepositional supersense annotations on non-preposition targets

    Is it OK for a verb-headed SMWE to have a prepositional supersense? The validator complains about it. Offending SMWE:

    21	give	give	VERB	VB	_	10	conj	_	_	2:1	_	give up on	p.Theme	p.Theme	_	_	_	_
    22	up	up	ADP	RP	_	21	compound:prt	_	_	2:2	_	_	_	_	_	_	_	_
    23	on	on	ADP	IN	_	24	case	_	_	2:3	_	_	_	_	_	_	_	_
    
    opened by lgessler 5
  • Prepositions unannotated for supersense

    Prepositions unannotated for supersense

    Token 6:

    # sent_id = french-f57dd6ab-5263-4c8a-e360-8ec683e6a37a-02
    # text = Once you have the hang of it it s pretty fast ( and does n't eat your clutch ) .
    1	Once	once	SCONJ	IN	_	3	mark	_	_	_	_	_	_	_	_	_	_	_
    2	you	you	PRON	PRP	_	3	nsubj	_	_	_	_	_	_	_	_	_	_	_
    3	have	have	VERB	VBP	_	11	advcl	_	_	_	_	_	_	_	_	_	_	_
    4	the	the	DET	DT	_	5	det	_	_	_	_	_	_	_	_	_	_	_
    5	hang	hang	NOUN	NN	_	3	obj	_	_	_	_	_	_	_	_	_	_	_
    6	of	of	ADP	IN	_	7	case	_	_	_	_	_	_	_	_	_	_	_
    7	it	it	PRON	PRP	_	5	nmod	_	_	_	_	_	_	_	_	_	_	_
    8	it	it	PRON	PRP	_	11	nsubj	_	_	_	_	_	_	_	_	_	_	_
    9	s	be	AUX	VBZ	_	11	cop	_	_	_	_	_	_	_	_	_	_	_
    10	pretty	pretty	ADV	RB	_	11	advmod	_	_	_	_	_	_	_	_	_	_	_
    11	fast	fast	ADJ	JJ	_	0	root	_	_	_	_	_	_	_	_	_	_	_
    12	(	(	PUNCT	-LRB-	_	16	punct	_	_	_	_	_	_	_	_	_	_	_
    13	and	and	CCONJ	CC	_	16	cc	_	_	_	_	_	_	_	_	_	_	_
    14	does	do	AUX	VBZ	_	16	aux	_	_	_	_	_	_	_	_	_	_	_
    15	n't	not	PART	RB	_	16	advmod	_	_	_	_	_	_	_	_	_	_	_
    16	eat	eat	VERB	VB	_	11	conj	_	_	_	_	_	_	_	_	_	_	_
    17	your	you	PRON	PRP$	_	18	nmod:poss	_	_	_	_	_	Possessor	Possessor	_	_	_	_
    18	clutch	clutch	NOUN	NN	_	16	obj	_	_	_	_	_	_	_	_	_	_	_
    19	)	)	PUNCT	-RRB-	_	11	punct	_	_	_	_	_	_	_	_	_	_	_
    20	.	.	PUNCT	.	_	11	punct	_	_	_	_	_	_	_	_	_	_	_
    

    I assumed that all preps were supposed to be annotated, but perhaps not?

    opened by lgessler 3
  • Apostrophes removed in preprocessing?

    Apostrophes removed in preprocessing?

    Looking through the data, there are a LOT of sentences where clitics are tokenized off but lack an apostrophe. Is that just the genre or did they get lost in preprocessing?

    opened by nschneid 2
  • Dataset requested

    Dataset requested

    Hi all,

    I would like to request the PASTRIE dataset accompanying the paper "PASTRIE: A Corpus of Prepositions Annotated with Supsersense Tags in Reddit International English".

    Thanks for reply.

    opened by fj-morales 2
  • SNACS supersense tags should start with

    SNACS supersense tags should start with "p."

    For compatibility with STREUSLE, it should be p.Locus, p.Theme, etc.

    Special labels like `i `d `c `$ ?? should not start with p.. In fact, the backtick labels from annotation are not represented as such in STREUSLE—they are reflected in the LEXCAT column of the data.

    opened by nschneid 0
  • Questionable adpositional MWEs

    Questionable adpositional MWEs

    • in_male_term — from "in male terms"; should be in_term (at most)
    • in_the_first_place
    • in_my_hand — from "in my hands"; should be in_hand (at most)
    • for_quite_some_time — just Duration for, weak MWE?
    • at_all_time — from what should have been "at all times". OK?
    • on_a_smaller_scale — omit adjective?
    • withouth — typo
    • see_as — "seeing as" (deverbal MWE acting like a preposition)
    opened by nschneid 0
  • Some undersegmentation of sentences

    Some undersegmentation of sentences

    Despite manual editing there are still places where a long sentence ought to be split up (esp. where it consists of a blockquoted sentence with > followed by a response). Looking for multiple consecutive spaces in the raw text uncovers some of these (as well as some discourse appendages like emoticons, which should probably remain in the same UD sentence).

    It would be nice to write a script to help clean these up—the tricky part is updating offsets in each parse.

    opened by nschneid 0
Releases(v2.0.1)
  • v2.0.1(Nov 21, 2021)

  • v2.0(Oct 22, 2021)

    • Switch to full .conllulex format following STREUSLE
      • add lexcats (#3), morphological features, newdoc directives
    • Scripts for validation and format conversion
    • Clean up various annotation issues, including:
      • restore apostrophes and fixing other conversion problems (#6, #9)
      • include pretokenized raw text (#12)
    Source code(tar.gz)
    Source code(zip)
  • v1.0.1(Dec 14, 2020)

    • Added .json file format
    • Switched lemmatization and pos tagging from StanfordNLP 0.2.0 to Stanza 1.1.1
    • Corrected rare encoding issue from v1.0
    Source code(tar.gz)
    Source code(zip)
Owner
NERT @ Georgetown
NERT @ Georgetown
Tensorflow implementation of the paper "HumanGPS: Geodesic PreServing Feature for Dense Human Correspondences", CVPR 2021.

HumanGPS: Geodesic PreServing Feature for Dense Human Correspondences Tensorflow implementation of the paper "HumanGPS: Geodesic PreServing Feature fo

Google Interns 50 Dec 21, 2022
Official code for "On the Frequency Bias of Generative Models", NeurIPS 2021

Frequency Bias of Generative Models Generator Testbed Discriminator Testbed This repository contains official code for the paper On the Frequency Bias

35 Nov 01, 2022
Cascading Feature Extraction for Fast Point Cloud Registration (BMVC 2021)

Cascading Feature Extraction for Fast Point Cloud Registration This repository contains the source code for the paper [Arxive link comming soon]. Meth

7 May 26, 2022
Models, datasets and tools for Facial keypoints detection

Template for Data Science Project This repo aims to give a robust starting point to any Data Science related project. It contains readymade tools setu

girafe.ai 1 Feb 11, 2022
RETRO-pytorch - Implementation of RETRO, Deepmind's Retrieval based Attention net, in Pytorch

RETRO - Pytorch (wip) Implementation of RETRO, Deepmind's Retrieval based Attent

Phil Wang 556 Jan 04, 2023
Collection of machine learning related notebooks to share.

ML_Notebooks Collection of machine learning related notebooks to share. Notebooks GAN_distributed_training.ipynb In this Notebook, TensorFlow's tutori

Sascha Kirch 14 Dec 22, 2022
This is a clean and robust Pytorch implementation of DQN and Double DQN.

DQN/DDQN-Pytorch This is a clean and robust Pytorch implementation of DQN and Double DQN. Here is the training curve: All the experiments are trained

XinJingHao 15 Dec 27, 2022
Official code repository for A Simple Long-Tailed Rocognition Baseline via Vision-Language Model.

BALLAD This is the official code repository for A Simple Long-Tailed Rocognition Baseline via Vision-Language Model. Requirements Python3 Pytorch(1.7.

peng gao 42 Nov 26, 2022
Pytorch implementation of our paper under review -- 1xN Pattern for Pruning Convolutional Neural Networks

1xN Pattern for Pruning Convolutional Neural Networks (paper) . This is Pytorch re-implementation of "1xN Pattern for Pruning Convolutional Neural Net

Mingbao Lin (林明宝) 29 Nov 29, 2022
This repository is an implementation of paper : Improving the Training of Graph Neural Networks with Consistency Regularization

CRGNN Paper : Improving the Training of Graph Neural Networks with Consistency Regularization Environments Implementing environment: GeForce RTX™ 3090

THUDM 28 Dec 09, 2022
ICON: Implicit Clothed humans Obtained from Normals

ICON: Implicit Clothed humans Obtained from Normals arXiv, December 2021. Yuliang Xiu · Jinlong Yang · Dimitrios Tzionas · Michael J. Black Table of C

Yuliang Xiu 1.1k Dec 30, 2022
Pytorch code for our paper "Feedback Network for Image Super-Resolution" (CVPR2019)

Feedback Network for Image Super-Resolution [arXiv] [CVF] [Poster] Update: Our proposed Gated Multiple Feedback Network (GMFN) will appear in BMVC2019

Zhen Li 539 Jan 06, 2023
Code and data (Incidents Dataset) for ECCV 2020 Paper "Detecting natural disasters, damage, and incidents in the wild".

Incidents Dataset See the following pages for more details: Project page: IncidentsDataset.csail.mit.edu. ECCV 2020 Paper "Detecting natural disasters

Ethan Weber 67 Dec 27, 2022
FIGARO: Generating Symbolic Music with Fine-Grained Artistic Control

FIGARO: Generating Symbolic Music with Fine-Grained Artistic Control by Dimitri von Rütte, Luca Biggio, Yannic Kilcher, Thomas Hofmann FIGARO: Generat

Dimitri 83 Jan 07, 2023
Linear algebra python - Number of operations and problems in Linear Algebra and Numerical Linear Algebra

Linear algebra in python Number of operations and problems in Linear Algebra and

Alireza 5 Oct 09, 2022
Lightweight library to build and train neural networks in Theano

Lasagne Lasagne is a lightweight library to build and train neural networks in Theano. Its main features are: Supports feed-forward networks such as C

Lasagne 3.8k Dec 29, 2022
The official implementation of the research paper "DAG Amendment for Inverse Control of Parametric Shapes"

DAG Amendment for Inverse Control of Parametric Shapes This repository is the official Blender implementation of the paper "DAG Amendment for Inverse

Elie Michel 157 Dec 26, 2022
Code for the paper: Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

[Paper] [Project page] This repository contains code for the paper: Andrew Owens, Alexei A. Efros. Audio-Visual Scene Analysis with Self-Supervised Mu

Andrew Owens 202 Dec 13, 2022
ICCV2021 Oral SA-ConvONet: Sign-Agnostic Optimization of Convolutional Occupancy Networks

Sign-Agnostic Convolutional Occupancy Networks Paper | Supplementary | Video | Teaser Video | Project Page This repository contains the implementation

63 Nov 18, 2022
《Dual-Resolution Correspondence Network》(NeurIPS 2020)

Dual-Resolution Correspondence Network Dual-Resolution Correspondence Network, NeurIPS 2020 Dependency All dependencies are included in asset/dualrcne

Active Vision Laboratory 45 Nov 21, 2022