Conditional probing: measuring usable information beyond a baseline

Overview

Three probe setups. First using both baseline and representation as input, achieving 100% accuracy. . Second using just baseline, achieving 75%; third using just representation, achieving 25%.

conditional-probing

Codebase for easy specification of (conditional) (V-information) probing experiments.

Highlights:

  • Conditional probing: measure only the aspects of property that aren't explainable by the baseline of your choice.
  • Train a probe using one or many layers from one or many models as input representations; loading and concatenation of representations performed automatically.
  • Integration with huggingface for specifying representation.
  • Heuristic subword-token-to-token alignment of Tenney et al., 2019 performed per-model.
  • Sort of smart caching of tokenized datasets and subword token alignment matrices to hdf5 files.
  • Modular design of probes, training regimen, representation models, and reporting.
  • Change out classes specifying probes or representations directly through YAML configs instead of if statements in code.

Written for the paper Conditional probing: measuring usable information beyond a baseline (EMNLP 2021).

Installing and getting started

  1. Clone the repository.

     git clone https://github.com/john-hewitt/vinfo-probing-internal/
     cd vinfo-probing-internal
    
  2. [Optional] Construct a virtual environment for this project. Only python3 is supported.

     conda create --name sp-env
     conda activate sp-env
    
  3. Install the required packages.

     conda install --file requirements.txt
    
  4. Run your first experiment using a provided config file. This experiment trains and reports a part-of-speech probe on layer 5 of the roberta-base model.

     python vinfo/experiment.py example/roberta768-upos-layer5-example-cpu.yaml
    
  5. Take a look at the config file, example/roberta768-upos-layer5-example.yaml. It states that the results and probe parameters are saved to example/, a directory that would've been created if it hadn't already existed. If your experiment ran without error, you should see the following files in that directory:

     dev.v_entropy
     train.v_entropy
     dev.label_acc
     train.label_acc
     params
    

    The v_entropy files store a single float: the variational entropy as estimated on the {dev,train} set. The label_acc files store a single float: the part-of-speech tagging accuracies on the {dev,train} set. The params file stores the probe parameters.

  6. Make a minimal change to the config file, say replacing the roberta-base model with another model, specified by its huggingface identifier string.

YAML-centric Design

This codebase revolves around the yaml configuration files that specify experiment settings. Intended to minimize the amount of experiment logic code needed to swap out new Probe, Loss, or Model classes when extending the repository, all python classes defined in the codebase are actually constructed with the yaml loading process.

This is mostly documented in the pyyaml docs, here, but briefly, consider the following config snippet:

cache: &id_cache !WholeDatasetCache
  train_path: &idtrainpath example/data/en_ewt-ud-sample/en_ewt-ud-train.conllu 
  dev_path: &iddevpath example/data/en_ewt-ud-sample/en_ewt-ud-dev.conllu
  test_path: &idtestpath example/data/en_ewt-ud-sample/en_ewt-ud-test.conllu 

When the yaml config is loaded, it result in a dictionary with the key cache. The fun magic part is that !WholeDatasetCache references the code in cache.py, wherein the WholeDatasetCache class has the class attribute yaml_tag = !WholeDatasetCache. The train_path, dev_path, test_path are the arguments to this class's __init__ function. Because of this, the value stored at key cache is an instance of WholeDatasetCache, constructed during yaml loading with the arguments provided.

All experiment objects -- Probes, Models, Datasets, are constructed during yaml initialization in the same way. Because of this, the logic for running an experiment -- in experiment.py -- is short.

Some yaml basics

If you're not familiar with yaml, it's worthwhile to take a peek at the documentation. We make frequent use of the referencing feature of yaml -- the ability to give an object in the .yaml config file an identifier, and place the same object elsewhere in the config file by referencing the identifier.

Making the label looks like:

input_fields: &id_input_fields
  - id
  - form

where the ampersand in &id_input_fields indicates the registration of an identifier; this object can then be placed elsewher in the config through

fields: *id_input_fields

where the asterisk in *id_input_fields indicates the reference of the object.

Limiting logic in __init__ due to yaml use

While the yaml object construction design decision makes it transparent which objects will be used in the course of a given experiment (instead of if/else/case statements that grow with the codebase scope), it adds a somewhat annoying consideration when writing code for these classes.

Stated briefly, all you can do in the __init__ functions of your classes is assign arguments as instance variables, like self.thing = thing; you cannot run any code that relies on thing being an already-constructed object.

In more depth, the yaml loading process doesn't provide a guarantee on what order objects will be constructed. But we refer to objects (like the input_fields list) in constructing other objects, through yaml object reference. (Since, say, the dataset classes need to know what the input_fields list is.) So, when going through yaml loading, we do call __init__ functions (see utils.py), but we are just passing around references and doing simple computation that doesn't depend on other yaml-constructed objects.

This means, somewhat unfortunately, that setup-style functionality, like checking the validity of cache files, for the dataset classes, has to be run at some time other than __init__. In practice, we check a check-for-setup condition into the functions that need the setup to have been run.

This toolkit is intended to be easily extensible, and allow for quick swapping of experimental components. As such, the code is split into an arguably reasonable class layout, wherein one can write a new Probe or new Loss class somewhat easily. More unusually,

Code layout and config runthrough

In this section we walk through the example configuration file and explain the classes associated with each component. Each of these subsections refers to an object constructed during yaml loading, which is a "top-level" object, available in the loaded yaml config.

Input-fields

Input-fields, for conll-formtted files, provides string labels for the columns of the file.

input_fields: &id_input_fields
  - id
  - form
  - lemma
  - upos
  - ptb_pos
  - feats
  - dep_head
  - dep_rel
  - None
  - misc

These identifiers will be used to pull the data of a column in the AnnotationDataset class; we'll go over this when we get to the dataset part of the config.

cache

The cache object does some simple filesystem timestamp checking, and non-foolproof lock checking, to determine whether cache files for each dataset should be read from, or written to. This is crucial for running many experiments with Huggingface transformers models, since the tokenization and alignment of subword tokens to corpus tokens takes more time than running the experiment itself once loaded.

cache: &id_cache !WholeDatasetCache
  train_path: &idtrainpath scripts/ontonotes_scripts/train.ontonotes.withdep.conll
  dev_path: &iddevpath scripts/ontonotes_scripts/dev.ontonotes.withdep.conll
  test_path: &idtestpath scripts/ontonotes_scripts/test.ontonotes.withdep.conll

Note that we make reference ids for both the WholeDatasetCache object itself and for the {train,dev,test} file paths, so we can use these later.

disk_reader

The Reader objects are written to handle the oddities of a given filetype. The OntonotesReader object, for example, reads conll files, turning lines into sentences (given the input_fields object, above), while the SST2Reader object knows how to read label\TABtokenized_sentence data, as given by the SST2 task of the GLUE benchmark.

disk_reader: !OntonotesReader &id_disk_reader
  args:
    device: cpu
  train_path: *idtrainpath 
  dev_path: *iddevpath 
  test_path: *idtestpath 

The args bit here is sort of a vestigal part of earlier code design; its only member, the device, is used whenver PyTorch objects are involved, to put tensors on the right device. Note how it references the dataset filepaths that were registered in the cache part of the config.

dataset

The ListDataset object is always the top-level object of the dataset key; its job is to gather together output labels, and all of the input types, concatenate together the input, and yield minibatches for training and evaluation.

dataset: !ListDataset
  args:
    device: cpu
  data_loader: *id_disk_reader
  output_dataset: !AnnotationDataset
    args:
      device: cpu
    task: !TokenClassificationTask
      args:
        device: cpu
      task_name: ptb_pos
      input_fields: *id_input_fields
  input_datasets:
    - !HuggingfaceData
      args:
        device: cpu
        #model_string: &model1string google/bert_uncased_L-2_H-128_A-2
      model_string: &model1string google/bert_uncased_L-4_H-128_A-2
      cache: *id_cache
  batch_size: 5 

It is given the DataLoader from above so it can read data from disk. It has a single specified Dataset for its output, here an AnnotationDataset. The AnnotationDataset given here takes in a Task object -- here a TokenClassificationTask, to provide the labels for the output task. the TokenClassificationTask provides a label, using the task_name to pick out a column from the conll input file, as labeled by the input_fields list.

The input_datasets argument is a list of Dataset objects. All of these datasets' representations are bundled together by the ListDataset. Here, we only have one element in the list, a HuggingfaceData object, which runs the huggingface model specified by the model_string, but we could add a representation by adding another entry to the list. The HuggingfaceData tokens and subword-to-corpus token alignment matrices will be read or written according to the cache given.

The Dataset generates (subword) tokens and alignment matrices, or label indices -- whatever a model needs as input.

Note that tasks like part-of-speech and dependency label, which have independent token-level labels, are easily exchangable in the TokenClassificationTask. But to run a task like named entity recognition, with its specialized specification of entity-level annotation (and evaluation, later), specialized classes are needed, like NERClassificationTask.

model

For each dataset in input_datasets, a corresponding model takes the raw tokens provided by a Dataset, and runs the corresponding model to turn the input into a representation. So, a HuggingfaceData above corresponds to a HuggingfaceModel here.

model: !ListModel
  args: 
    device: cpu
  models:
    - !HuggingfaceModel
        args: 
          device: cpu
        model_string: *model1string
        trainable: False
        index: 1

The HuggingfaceModel class runs the transformer model, and provides the representations of the layer at index index. The trainable flag specifies whether to backprogate gradients back through the model and update its weights during training.

probe

The Probe classes turn the representations given by Model classes into the logits of a distribution over the labels of the output task.

probe: !OneWordLinearLabelProbe
  args:
    device: cpu
  model_dim: 128
  label_space_size: 50

Somewhat unfortunately, it needs to be explicitly told what input and output dimensionality to expect.

regimen

The regimen specifies a training procedure, with learning rate decay, loss, etc. Most of this is hard-coded right now to sane defaults.

regimen: !ProbeRegimen
  args:
    device: cpu
  max_epochs: 50
  params_path: params
  reporting_root: &id_reporting_root example/pos-bert-base.yaml.results
  eval_dev_every: 10

There's only one trainer as of now. By convention, I put results directories at .results . The params_path is relative to reporting_root.

reproter

The reporter class takes predictions at the end of training, and reports evaluation metrics.

reporter: !IndependentLabelReporter
  args:
    device: cpu
  reporting_root: *id_reporting_root
  reporting_methods:
    - label_accuracy
    - v_entropy

For each of the strings in reporting_methods, a reporter function (which is specified by a hard-coded map from reporting string to function) is run on the data. The result of the metric is written to / . .

Note that some reporters and metrics are specialized to a task. For example, SST2 has its own SST2Reporter (though it's really just a sentence-level classification reporter) and named entity recognition has its own NERReporter, which calls the Stanza library's NER evaluation script.

Config recipes

Replicating the EMNLP 2021 paper

Take a look at our CodaLab executable paper for the exact bash scripts we ran to reproduce all the numbers in the paper. The configs that govern each of the experiments are under

    configs/codalab/round1/{task_name}/{roberta768,elmo}/layer-*.yaml

where task_name is one of ptb_pos, upos, dep_rel, named_entities, sst2.

Named Entity Recognition config recipe

For an example of an NER config (e.g., using span-based eval), see

    configs/round1/named_entities/roberta768/layer0.yaml

SST2 config recipe

For an example of a sentiment config (e.g., averaging the word embeddings for a sentence embedding), see

    configs/round1/sst2/roberta768/layer0.yaml

Data preparation

Ontonotes

See the scripts/ontonotes_scripts directory for notes on how we prep ontonotes. The scripts we use exactly recreate the splits of Strubell et al., 2017, a well-used split that, due to changes in preprocessing script versioning and link rot over the years of CoNLL and Ontonotes, had become (to us) difficult to re-create. As such, to the greatest extent possible, we just paste the exact scripts here instead of linking to them.

If you just want the data, it's a few steps:

Let ldc_ontonotes_path be the path to your LDC download of Ontonotes 5.0, that is, LDC2013T19. Mine looks like /scr/corpora/ldc/2013/LDC2013T19/ontonotes-release-5.0/data/files/data/. Unfortunately, we can't host this for you.

Next, due to some regrettable firewalling, our script to download the train/dev/test split information fails, so you have to navigate via a browser to:

  https://cemantix.org/conll/2012/download/

and manually download conll-2012-train.v4.tar.gz, conll-2012-development.v4.tar.gz, and then navigate to the test folder and download conll-2012-test-key.tar.gz. Place these files in the scripts/ontonotes_scripts/ directory of this repository.

Now, run

cd scripts/ontonotes_scripts
ldc_ontonotes_path=/scr/corpora/ldc/2013/LDC2013T19/ontonotes-release-5.0/data/files/data/
bash prep_ontonotes_v4.sh $ldc_onotonotes_path

Nice.

Statistics:

Train Dev Test
Sentences 59,924 8,528 8,262
Tokens 1,088,503 147,724 152,728

Citation

If you use this repository, please cite:

  @InProceedings{hewitt2021conditional,
    author =      "Hewitt, John and Ethayarajh, Kawin and Liang, Percy and Manning, Christopher D.",
    title =       "Conditional probing: measuring usable information beyond a baseline",
    booktitle =   "Conference on Empirical Methods in Natural Language Processing",
    year =        "2021",
    publisher =   "Association for Computational Linguistics",
    location =    "Punta Cana, Dominican Republic",
  }
Owner
John Hewitt
I'm a PhD student working on: NLP, structure, graphs, bash scripts, RNNs, multilinguality, and teaching others to do the same.
John Hewitt
A Facebook Messenger Chatbot using NLP

A Facebook Messenger Chatbot using NLP This project is about creating a messenger chatbot using basic NLP techniques and models like Logistic Regressi

6 Nov 20, 2022
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

ALBERT ***************New March 28, 2020 *************** Add a colab tutorial to run fine-tuning for GLUE datasets. ***************New January 7, 2020

Google Research 3k Dec 26, 2022
Jarvis is a simple Chatbot with a GUI capable of chatting and retrieving information and daily news from the internet for it's user.

J.A.R.V.I.S Kindly consider starring this repository if you like the program :-) What/Who is J.A.R.V.I.S? J.A.R.V.I.S is an chatbot written that is bu

Epicalable 50 Dec 31, 2022
Contract Understanding Atticus Dataset

Contract Understanding Atticus Dataset This repository contains code for the Contract Understanding Atticus Dataset (CUAD), a dataset for legal contra

The Atticus Project 273 Dec 17, 2022
Python functions for summarizing and improving voice dictation input.

Helpmespeak Help me speak uses Python functions for summarizing and improving voice dictation input. Get started with OpenAI gpt-3 OpenAI is a amazing

Margarita Humanitarian Foundation 6 Dec 17, 2022
aMLP Transformer Model for Japanese

aMLP-japanese Japanese aMLP Pretrained Model aMLPとは、Liu, Daiらが提案する、Transformerモデルです。 ざっくりというと、BERTの代わりに使えて、より性能の良いモデルです。 詳しい解説は、こちらの記事などを参考にしてください。 この

tanreinama 13 Aug 11, 2022
Blackstone is a spaCy model and library for processing long-form, unstructured legal text

Blackstone Blackstone is a spaCy model and library for processing long-form, unstructured legal text. Blackstone is an experimental research project f

ICLR&D 579 Jan 08, 2023
The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models

Graformer The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models Graformer (also named BridgeTransformer in t

22 Dec 14, 2022
Findings of ACL 2021

Assessing Dialogue Systems with Distribution Distances [arXiv][code] We propose to measure the performance of a dialogue system by computing the distr

Yahui Liu 16 Feb 24, 2022
VD-BERT: A Unified Vision and Dialog Transformer with BERT

VD-BERT: A Unified Vision and Dialog Transformer with BERT PyTorch Code for the following paper at EMNLP2020: Title: VD-BERT: A Unified Vision and Dia

Salesforce 44 Nov 01, 2022
运小筹公众号是致力于分享运筹优化(LP、MIP、NLP、随机规划、鲁棒优化)、凸优化、强化学习等研究领域的内容以及涉及到的算法的代码实现。

OlittleRer 运小筹公众号是致力于分享运筹优化(LP、MIP、NLP、随机规划、鲁棒优化)、凸优化、强化学习等研究领域的内容以及涉及到的算法的代码实现。编程语言和工具包括Java、Python、Matlab、CPLEX、Gurobi、SCIP 等。 关注我们: 运筹小公众号 有问题可以直接在

运小筹 151 Dec 30, 2022
STT for TorchScript is a port of Coqui STT based on DeepSpeech to PyTorch.

st3 STT for TorchScript is a port of Coqui STT based on DeepSpeech to PyTorch. Currently it supports converting pbmm models to pt scripts with integra

Vlad Ki 8 Oct 18, 2021
Simple program that translates the name of files into English

Simple program that translates the name of files into English. Useful for when editing/inspecting programs that were developed in a foreign language.

0 Dec 22, 2021
Simple Python library, distributed via binary wheels with few direct dependencies, for easily using wav2vec 2.0 models for speech recognition

Wav2Vec2 STT Python Beta Software Simple Python library, distributed via binary wheels with few direct dependencies, for easily using wav2vec 2.0 mode

David Zurow 22 Dec 29, 2022
Labelling platform for text using distant supervision

With DataQA, you can label unstructured text documents using rule-based distant supervision.

245 Aug 05, 2022
無料で使える中品質なテキスト読み上げソフトウェア、VOICEVOXの音声合成エンジン

VOICEVOX ENGINE VOICEVOXの音声合成エンジン。 実態は HTTP サーバーなので、リクエストを送信すればテキスト音声合成できます。 API ドキュメント VOICEVOX ソフトウェアを起動した状態で、ブラウザから

Hiroshiba 3 Jul 05, 2022
In this project, we compared Spanish BERT and Multilingual BERT in the Sentiment Analysis task.

Applying BERT Fine Tuning to Sentiment Classification on Amazon Reviews Abstract Sentiment analysis has made great progress in recent years, due to th

Alexander Leonardo Lique Lamas 5 Jan 03, 2022
SASE : Self-Adaptive noise distribution network for Speech Enhancement with heterogeneous data of Cross-Silo Federated learning

SASE : Self-Adaptive noise distribution network for Speech Enhancement with heterogeneous data of Cross-Silo Federated learning We propose a SASE mode

Tower 1 Nov 20, 2021
Official PyTorch implementation of Time-aware Large Kernel (TaLK) Convolutions (ICML 2020)

Time-aware Large Kernel (TaLK) Convolutions (Lioutas et al., 2020) This repository contains the source code, pre-trained models, as well as instructio

Vasileios Lioutas 28 Dec 07, 2022