PASTRIE: A Corpus of Prepositions Annotated with Supersense Tags in Reddit International English

Last update: Dec 02, 2021

Related tags

Overview

PASTRIE

Official release of the corpus described in the paper:

Michael Kranzlein, Emma Manning, Siyao Peng, Shira Wein, Aryaman Arora, and Nathan Schneider (2020). PASTRIE: A Corpus of Prepositions Annotated with Supersense Tags in Reddit International English [link]. Proceedings of the 14th Linguistic Annotation Workshop.

Overview

PASTRIE is a corpus of English data from Reddit annotated with preposition supersenses from the SNACS inventory.

While the data in PASTRIE is in English, it was produced by presumed speakers of four L1s:

English
French
German
Spanish

For details on how L1s were identified, see section 3.1 of Rabinovich et al. (2018).

Annotation Example

Below is an example sentence from the corpus, where annotation targets are bolded and preposition supersenses are annotated with the notation SceneRole↝Function. Together, a scene role and function are known as a construal.

Data Formats

PASTRIE is released in the following formats. We expect that most projects will be best served by one of the JSON formats.

.conllulex: the 19-column CoNLL-U-Lex format originally used for STREUSLE.
.json: a JSON representation of the CoNLL-U-Lex that does not require a CoNLL-U-Lex parser.
.govobj.json: an extended version of the JSON representation that contains information about each preposition's syntactic parent and object.

PASTRIE mostly follows STREUSLE with respect to the data format and SNACS annotation practice. Primary differences in the annotations are:

Lemmas, part-of-speech tags, and syntactic dependencies aim to follow the UD standard in both cases. They are gold in STREUSLE, versus automatic with some manual corrections in PASTRIE.
- PASTRIE does not group together base+clitic combinations, whereas STREUSLE does (multiword tokens—where a single orthographic word contains multiple syntactic words).
- PASTRIE does not regularly specify SpaceAfter=No to indicate alignment between the tokens and the raw text.
- In PASTRIE, the raw text string accompanying the sentence may contain two or more consecutive spaces.
- PASTRIE lacks enhanced dependencies.
Multiword expression annotations in PASTRIE are limited to expressions containing a preposition. Depending on the syntactic head, the expression may or may not have a SNACS supersense.
- Verbal multiword expressions in PASTRIE are not subtyped in the lexcat; they all have a lexcat of V.
Noun and verb expressions in PASTRIE do not have supersense labels.

Comments

Misc. annotation errors and/or conversion script bugs

There are some annotations which I'm fairly sure are incorrect and are choking up the JSON conversion script. (These errors occur using the unmodified versions of all scripts taken straight from STRUESLE.) One or two might also be indicative of a bug in the conllulex2json.py file.

vs mistagged as a noun--should be prep

AssertionError: ('french-fad32caf-e595-e3cb-07bf-aaea891e53cb-02', {'lexlemma': 'versus', 'lexcat': 'CCONJ', 'ss': 'c', 'ss2': 'c', 'toknums': [3]}, {'#': 3, 'word': 'vs', 'lemma': 'versus', 'upos': 'NOUN', 'xpos': 'NN', 'feats': None, 'head': 8, 'deprel': 'nsubj', 'edeps': None, 'misc': None, 'smwe': None, 'wmwe': None, 'lextag': 'O-CCONJ-`c'})

ditto

AssertionError: ('french-fad32caf-e595-e3cb-07bf-aaea891e53cb-02', {'lexlemma': 'versus', 'lexcat': 'CCONJ', 'ss': 'c', 'ss2': 'c', 'toknums': [3]}, {'#': 3, 'word': 'vs', 'lemma': 'versus', 'upos': 'NOUN', 'xpos': 'NN', 'feats': None, 'head': 8, 'deprel': 'nsubj', 'edeps': None, 'misc': None, 'smwe': None, 'wmwe': None, 'lextag': 'O-CCONJ-`c'})

Script complains about "to" in this snippet at ID=23. Not immediately clear to me what the issue is--perhaps that "to" is labeled ADP/IN? For its xpos I think it ought to be TO, not sure about its upos. Snippet:

13      shit    shit    NOUN    NN      _       16      obl:npmod       _       _       _       _       _       _       _       _       _       _       _
14      this    this    PRON    DT      _       16      nsubj   _       _       _       _       _       _       _       _       _       _       _
15      can     can     AUX     MD      _       16      aux     _       _       _       _       _       _       _       _       _       _       _
16      end     end     VERB    VB      _       4       parataxis       _       _       _       _       _       _       _       _       _       _       _
17      right   right   ADV     RB      _       18      advmod  _       _       _       _       _       _       _       _       _       _       _
18      now     now     ADV     RB      _       16      advmod  _       _       _       _       _       _       _       _       _       _       _
19      if      if      SCONJ   IN      _       21      mark    _       _       _       _       _       _       _       _       _       _       _
20      I       I       PRON    PRP     _       21      nsubj   _       _       _       _       _       _       _       _       _       _       _
21      want    want    VERB    VBP     _       16      advcl   _       _       _       _       _       _       _       _       _       _       _
22      it      it      PRON    PRP     _       21      obj     _       _       _       _       _       _       _       _       _       _       _
23      to      to      ADP     IN      _       21      obl     _       _       _       _       _       `i      `i      _       _       _       _
24      .       .       PUNCT   .       _       4       punct   _       _       _       _       _       _       _       _       _       _       _

Error:

AssertionError: ('french-a17a4340-f9c0-8fef-fa1b-1bf13879399b-02', {'lexlemma': 'to', 'lexcat': 'INF', 'ss': 'i', 'ss2': 'i', 'toknums': [23]}, {'#': 23, 'word': 'to', 'lemma': 'to', 'upos': 'ADP', 'xpos': 'IN', 'feats': None, 'head': 21, 'deprel': 'obl', 'edeps': None, 'misc': None, 'smwe': None, 'wmwe': None, 'lextag': 'O-INF-`i'})

Relevant span of code:

            if validate_pos and upos!=lc and (upos,lc) not in {('NOUN','N'),('PROPN','N'),('VERB','V'),
                ('ADP','P'),('ADV','P'),('SCONJ','P'),
                ('ADP','DISC'),('ADV','DISC'),('SCONJ','DISC'),
                ('PART','POSS')}:
                # most often, the single-word lexcat should match its upos
                # check a list of exceptions
                mismatchOK = False
                if xpos=='TO' and lc.startswith('INF'):
                    mismatchOK = True
                elif (xpos=='TO')!=lc.startswith('INF'):
                    assert upos in ['SCONJ', "ADP"] and swe['lexlemma']=='for',(sent['sent_id'],swe,tok)
                    mismatchOK = True

Originator as function:

(in french-c02823ec-60bd-adce-7327-01337eb9d1c8-02) AssertionError: ('p.Originator should never be function', {'lexlemma': 'you', 'lexcat': 'PRON.POSS', 'ss': 'p.Originator', 'ss2': 'p.Originator', 'toknums': [1]})

lexcat DISC with ADJ:

AssertionError: In spanish-a25e8289-e04a-f5af-ce56-ead9faca65b1-02, single-word expression 'like' has lexcat DISC, which is incompatible with its upos ADJ

"her" tagged with Possessor is incorrectly parsed as iobj and tagged as PRP instead of PRP$. Relevant snippet:

1       My      my      PRON    PRP$    _       2       nmod:poss       _       _       _       _       _       SocialRel       Gestalt _       _       _       _
2       grandma grandma NOUN    NN      _       3       nsubj   _       _       _       _       _       _       _       _       _       _       _
3       had     have    VERB    VBD     _       0       root    _       _       _       _       _       _       _       _       _       _       _
4       her     she     PRON    PRP     _       3       iobj    _       _       _       _       _       Possessor       Possessor       _       _       _       _
5       super   super   ADV     RB      _       6       advmod  _       _       _       _       _       _       _       _       _       _       _
6       thick   thick   ADJ     JJ      _       8       amod    _       _       _       _       _       _       _       _       _       _       _
7       floor   floor   NOUN    NN      _       8       compound        _       _       _       _       _       _       _       _       _       _       _
8       mats    mat     NOUN    NNS     _       3       obj     _       _       _       _       _       _       _       _       _       _       _
9       *       *       PUNCT   NFP     _       8       punct   _       _       _       _       _       _       _       _       _       _       _
10      over    over    ADP     IN      _       13      case    _       _       _       _       _       Locus   Locus   _       _       _       _
11      *       *       PUNCT   NFP     _       13      punct   _       _       _       _       _       _       _       _       _       _       _
12      the     the     DET     DT      _       13      det     _       _       _       _       _       _       _       _       _       _       _
13      accelerator     accelerator     NOUN    NN      _       3       obl     _       _       _       _       _       _       _       _       _       _       _
14      ,       ,       PUNCT   ,       _       3       punct   _       _       _       _       _       _       _       _       _       _       _

Error:

AssertionError: In spanish-ebba3c73-2431-c216-8f4d-d469ee8d5564-01, single-word expression 'her' has lexcat P, which is incompatible with its upos PRON

"NA" is misannotated--this is NA as in North America, i.e. a PROPN/NP, but it's lemmatized as "no", and its tags are weird.

AssertionError: ('german-35000895-1d78-c18a-01ed-f7410b9c0581-01', {'lexlemma': 'no', 'lexcat': 'ADV', 'ss': None, 'ss2': None, 'toknums': [5]}, {'#': 5, 'word': 'NA', 'lemma': 'no', 'upos': 'PART', 'xpos': 'TO', 'feats': None, 'head': 6, 'deprel': 'mark', 'edeps': None, 'misc': None, 'smwe': None, 'wmwe': None, 'lextag': 'O-ADV'})

opened by lgessler 6

Prepositional supersense annotations on non-preposition targets
Is it OK for a verb-headed SMWE to have a prepositional supersense? The validator complains about it. Offending SMWE:

21 give give VERB VB _ 10 conj _ _ 2:1 _ give up on p.Theme p.Theme _ _ _ _ 22 up up ADP RP _ 21 compound:prt _ _ 2:2 _ _ _ _ _ _ _ _ 23 on on ADP IN _ 24 case _ _ 2:3 _ _ _ _ _ _ _ _
opened by lgessler 5

Prepositions unannotated for supersense

Token 6:

# sent_id = french-f57dd6ab-5263-4c8a-e360-8ec683e6a37a-02
# text = Once you have the hang of it it s pretty fast ( and does n't eat your clutch ) .
1	Once	once	SCONJ	IN	_	3	mark	_	_	_	_	_	_	_	_	_	_	_
2	you	you	PRON	PRP	_	3	nsubj	_	_	_	_	_	_	_	_	_	_	_
3	have	have	VERB	VBP	_	11	advcl	_	_	_	_	_	_	_	_	_	_	_
4	the	the	DET	DT	_	5	det	_	_	_	_	_	_	_	_	_	_	_
5	hang	hang	NOUN	NN	_	3	obj	_	_	_	_	_	_	_	_	_	_	_
6	of	of	ADP	IN	_	7	case	_	_	_	_	_	_	_	_	_	_	_
7	it	it	PRON	PRP	_	5	nmod	_	_	_	_	_	_	_	_	_	_	_
8	it	it	PRON	PRP	_	11	nsubj	_	_	_	_	_	_	_	_	_	_	_
9	s	be	AUX	VBZ	_	11	cop	_	_	_	_	_	_	_	_	_	_	_
10	pretty	pretty	ADV	RB	_	11	advmod	_	_	_	_	_	_	_	_	_	_	_
11	fast	fast	ADJ	JJ	_	0	root	_	_	_	_	_	_	_	_	_	_	_
12	(	(	PUNCT	-LRB-	_	16	punct	_	_	_	_	_	_	_	_	_	_	_
13	and	and	CCONJ	CC	_	16	cc	_	_	_	_	_	_	_	_	_	_	_
14	does	do	AUX	VBZ	_	16	aux	_	_	_	_	_	_	_	_	_	_	_
15	n't	not	PART	RB	_	16	advmod	_	_	_	_	_	_	_	_	_	_	_
16	eat	eat	VERB	VB	_	11	conj	_	_	_	_	_	_	_	_	_	_	_
17	your	you	PRON	PRP$	_	18	nmod:poss	_	_	_	_	_	Possessor	Possessor	_	_	_	_
18	clutch	clutch	NOUN	NN	_	16	obj	_	_	_	_	_	_	_	_	_	_	_
19	)	)	PUNCT	-RRB-	_	11	punct	_	_	_	_	_	_	_	_	_	_	_
20	.	.	PUNCT	.	_	11	punct	_	_	_	_	_	_	_	_	_	_	_

I assumed that all preps were supposed to be annotated, but perhaps not?

opened by lgessler 3

Apostrophes removed in preprocessing?

Looking through the data, there are a LOT of sentences where clitics are tokenized off but lack an apostrophe. Is that just the genre or did they get lost in preprocessing?

opened by nschneid 2
Dataset requested

Hi all,

I would like to request the PASTRIE dataset accompanying the paper "PASTRIE: A Corpus of Prepositions Annotated with Supsersense Tags in Reddit International English".

Thanks for reply.

opened by fj-morales 2
SNACS supersense tags should start with "p."

For compatibility with STREUSLE, it should be p.Locus, p.Theme, etc.

Special labels like `i `d `c `$ ?? should not start with p.. In fact, the backtick labels from annotation are not represented as such in STREUSLE—they are reflected in the LEXCAT column of the data.

opened by nschneid 0
Questionable adpositional MWEs
in_male_term — from "in male terms"; should be in_term (at most)

in_the_first_place

in_my_hand — from "in my hands"; should be in_hand (at most)

for_quite_some_time — just Duration for, weak MWE?

at_all_time — from what should have been "at all times". OK?

on_a_smaller_scale — omit adjective?

withouth — typo

see_as — "seeing as" (deverbal MWE acting like a preposition)
opened by nschneid 0
Some undersegmentation of sentences

Despite manual editing there are still places where a long sentence ought to be split up (esp. where it consists of a blockquoted sentence with > followed by a response). Looking for multiple consecutive spaces in the raw text uncovers some of these (as well as some discourse appendages like emoticons, which should probably remain in the same UD sentence).

It would be nice to write a script to help clean these up—the tricky part is updating offsets in each parse.

opened by nschneid 0

Releases(v2.0.1)

v2.0.1(Nov 21, 2021)
Fixes 3 erroneous sentence IDs (along with beefed up sentence ID validation in scripts). (#16)

Source code(tar.gz)
Source code(zip)
v2.0(Oct 22, 2021)
Switch to full .conllulex format following STREUSLE

add lexcats (#3), morphological features, newdoc directives

Scripts for validation and format conversion

Clean up various annotation issues, including:

restore apostrophes and fixing other conversion problems (#6, #9)

include pretokenized raw text (#12)

Source code(tar.gz)
Source code(zip)
v1.0.1(Dec 14, 2020)
Added .json file format

Switched lemmatization and pos tagging from StanfordNLP 0.2.0 to Stanza 1.1.1

Corrected rare encoding issue from v1.0

Source code(tar.gz)
Source code(zip)
v1.0(Dec 12, 2020)

Source code(tar.gz)
Source code(zip)

Owner

NERT @ Georgetown

GitHub Repository

(ICCV 2021) Official code of "Dressing in Order: Recurrent Person Image Generation for Pose Transfer, Virtual Try-on and Outfit Editing."

Dressing in Order (DiOr) 👚 [Paper] 👖 [Webpage] 👗 [Running this code] The official implementation of "Dressing in Order: Recurrent Person Image Gene

277 Dec 28, 2022

Causal Imitative Model for Autonomous Driving

Causal Imitative Model for Autonomous Driving Mohammad Reza Samsami, Mohammadhossein Bahari, Saber Salehkaleybar, Alexandre Alahi. arXiv 2021. [Projec

8 Oct 04, 2022

Dialect classification

Dialect-Classification This repository presents the data that was used in a talk at ICKL-5 (5th International Conference on Kurdish Linguistics) at th

0 Nov 12, 2021

Minimal implementation of Denoised Smoothing: A Provable Defense for Pretrained Classifiers in TensorFlow.

Denoised-Smoothing-TF Minimal implementation of Denoised Smoothing: A Provable Defense for Pretrained Classifiers in TensorFlow. Denoised Smoothing is

19 Dec 11, 2022

SphereFace: Deep Hypersphere Embedding for Face Recognition

SphereFace: Deep Hypersphere Embedding for Face Recognition By Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj and Le Song License SphereFa

1.5k Dec 29, 2022

[ACM MM 2021] Diverse Image Inpainting with Bidirectional and Autoregressive Transformers

Diverse Image Inpainting with Bidirectional and Autoregressive Transformers Installation pip install -r requirements.txt Dataset Preparation Given the

25 Nov 09, 2022

An unsupervised learning framework for depth and ego-motion estimation from monocular videos

SfMLearner This codebase implements the system described in the paper: Unsupervised Learning of Depth and Ego-Motion from Video Tinghui Zhou, Matthew

1.8k Dec 30, 2022

Train emoji embeddings based on emoji descriptions.

emoji2vec This is my attempt to train, visualize and evaluate emoji embeddings as presented by Ben Eisner, Tim Rocktäschel, Isabelle Augenstein, Matko

17 Sep 03, 2022

[ICCV 2021] Target Adaptive Context Aggregation for Video Scene Graph Generation

Target Adaptive Context Aggregation for Video Scene Graph Generation This is a PyTorch implementation for Target Adaptive Context Aggregation for Vide

44 Dec 14, 2022

Flexible-Modal Face Anti-Spoofing: A Benchmark

Flexible-Modal FAS This is the official repository of "Flexible-Modal Face Anti-

22 Nov 10, 2022

Geometric Sensitivity Decomposition

Geometric Sensitivity Decomposition This repo is the official implementation of A Geometric Perspective towards Neural Calibration via Sensitivity Dec

16 Dec 26, 2022

PyQt6 configuration in yaml format providing the most simple script.

PyamlQt（ぴゃむるきゅーと） PyQt6 configuration in yaml format providing the most simple script. Requirements yaml PyQt6, ( PyQt5 ) Installation pip install Pya

7 Aug 15, 2022

AgML is a comprehensive library for agricultural machine learning

AgML is a comprehensive library for agricultural machine learning. Currently, AgML provides access to a wealth of public agricultural datasets for common agricultural deep learning tasks.

1 Jul 07, 2022

Qt-GUI implementation of the YOLOv5 algorithm (ver.6 and ver.5)

YOLOv5-GUI 🎉 YOLOv5算法(ver.6及ver.5)的Qt-GUI实现 🎉 Qt-GUI implementation of the YOLOv5 algorithm (ver.6 and ver.5). 基于YOLOv5的v5版本和v6版本及Javacr大佬的UI逻辑进行编写

12 Dec 28, 2022

Project Aquarium is a SUSE-sponsored open source project aiming at becoming an easy to use, rock solid storage appliance based on Ceph.

Project Aquarium Project Aquarium is a SUSE-sponsored open source project aiming at becoming an easy to use, rock solid storage appliance based on Cep

73 Jul 21, 2022

FNet Implementation with TensorFlow & PyTorch

FNet Implementation with TensorFlow & PyTorch. TensorFlow & PyTorch implementation of the paper "FNet: Mixing Tokens with Fourier Transforms". Overvie

1 Feb 12, 2022

Fully Adaptive Bayesian Algorithm for Data Analysis (FABADA) is a new approach of noise reduction methods. In this repository is shown the package developed for this new method based on \citepaper.

Fully Adaptive Bayesian Algorithm for Data Analysis FABADA FABADA is a novel non-parametric noise reduction technique which arise from the point of vi

18 Oct 20, 2022

PASTRIE: A Corpus of Prepositions Annotated with Supersense Tags in Reddit International English

Related tags

Overview

PASTRIE

Overview

Annotation Example

Data Formats

Comments

Releases(v2.0.1)

v2.0.1(Nov 21, 2021)

v2.0(Oct 22, 2021)

v1.0.1(Dec 14, 2020)

v1.0(Dec 12, 2020)

Owner

NERT @ Georgetown

(ICCV 2021) Official code of "Dressing in Order: Recurrent Person Image Generation for Pose Transfer, Virtual Try-on and Outfit Editing."

Causal Imitative Model for Autonomous Driving

Dialect classification

Minimal implementation of Denoised Smoothing: A Provable Defense for Pretrained Classifiers in TensorFlow.

SphereFace: Deep Hypersphere Embedding for Face Recognition

[ACM MM 2021] Diverse Image Inpainting with Bidirectional and Autoregressive Transformers

An unsupervised learning framework for depth and ego-motion estimation from monocular videos

Train emoji embeddings based on emoji descriptions.

[ICCV 2021] Target Adaptive Context Aggregation for Video Scene Graph Generation

Flexible-Modal Face Anti-Spoofing: A Benchmark

Geometric Sensitivity Decomposition

PyQt6 configuration in yaml format providing the most simple script.

AgML is a comprehensive library for agricultural machine learning

Qt-GUI implementation of the YOLOv5 algorithm (ver.6 and ver.5)

Project Aquarium is a SUSE-sponsored open source project aiming at becoming an easy to use, rock solid storage appliance based on Ceph.

FNet Implementation with TensorFlow & PyTorch

Fully Adaptive Bayesian Algorithm for Data Analysis (FABADA) is a new approach of noise reduction methods. In this repository is shown the package developed for this new method based on \citepaper.

Geometry-Aware Learning of Maps for Camera Localization (CVPR2018)

A repository for storing njxzc final exam review material

Algo-burn - Script to configure an Algorand address as a "burn" address for one or more ASA tokens