Text-Based Ideal Points

Related tags

Deep Learningtbip
Overview

Text-Based Ideal Points

Source code for the paper: Text-Based Ideal Points by Keyon Vafa, Suresh Naidu, and David Blei (ACL 2020).

Update (June 29, 2020): We have added interactive visualizations of topics learned by our model.

Update (May 25, 2020): We have added a PyTorch implementation of the text-based ideal point model.

Update (May 11, 2020): See our Colab notebook to run the model online. Our Github code is more complete, and it can be used to reproduce all of our experiments. However, the TBIP is fastest on GPU, so if you do not have access to a GPU you can use Colab's GPUs for free.

Installation for GPU

Configure a virtual environment using Python 3.6+ (instructions here). Inside the virtual environment, use pip to install the required packages:

(venv)$ pip install -r requirements.txt

The main dependencies are Tensorflow (1.14.0) and Tensorflow Probability (0.7.0).

Installation for CPU

To run on CPU, a version of Tensorflow that does not use GPU must be installed. In requirements.txt, comment out the line that says tensorflow-gpu==1.14.0 and uncomment the line that says tensorflow==1.14.0. Note: the script will be noticeably slower on CPU.

Data

Preprocessed Senate speech data for the 114th Congress is included in data/senate-speeches-114. The original data is from [1]. Preprocessed 2020 Democratic presidential candidate tweet data is included in data/candidate-tweets-2020.

To include a customized data set, first create a repo data/{dataset_name}/clean/. The following four files must be inside this folder:

  • counts.npz: a [num_documents, num_words] sparse CSR matrix containing the word counts for each document.
  • author_indices.npy: a [num_documents] vector where each entry is an integer in the set {0, 1, ..., num_authors - 1}, indicating the author of the corresponding document in counts.npz.
  • vocabulary.txt: a [num_words]-length file where each line denotes the corresponding word in the vocabulary.
  • author_map.txt: a [num_authors]-length file where each line denotes the name of an author in the corpus.

See data/senate-speeches-114/clean for an example of what the four files look like for Senate speeches. The script setup/senate_speeches_to_bag_of_words.py contains example code for creating the four files from unprocessed data.

Learning text-based ideal points

Run tbip.py to produce ideal points. For the Senate speech data, use the command:

(venv)$ python tbip.py  --data=senate-speeches-114  --batch_size=512  --max_steps=100000

You can view Tensorboard while training to see summaries of training (including the learned ideal points and ideological topics). To run Tensorboard, use the command:

(venv)$ tensorboard  --logdir=data/senate-speeches-114/tbip-fits/  --port=6006

The command should output a link where you can view the Tensorboard results in real time. The fitted parameters will be stored in data/senate-speeches-114/tbip-fits/params. To perform the above analyses for the 2020 Democratic candidate tweets, replace senate-speeches-114 with candidate-tweets-2020.

To run custom data, we recommend training Poisson factorization before running the TBIP script for best results. If you have custom data stored in data/{dataset_name}/clean/, you can run

(venv)$ python setup/poisson_factorization.py  --data={dataset_name}

The default number of topics is 50. To use a different number of topics, e.g. 100, use the flag --num_topics=100. After Poisson factorization finishes, use the following command to run the TBIP:

(venv)$ python tbip.py  --data={dataset_name}

You can adjust the batch size, learning rate, number of topics, and number of steps by using the flags --batch_size, --learning_rate, --num_topics, and --max_steps, respectively. To run the TBIP without initializing from Poisson factorization, use the flag --pre_initialize_parameters=False. To view the results in Tensorboard, run

(venv)$ tensorboard  --logdir=data/{dataset_name}/tbip-fits/  --port=6006

Again, the learned parameters will be stored in data/{dataset_name}/tbip-fits/params.

Reproducing Paper Results

NOTE: Since the publication of our paper, we have made small changes to the code that have sped up inference. A byproduct of these changes is that the Tensorflow graph has changed, so its random seed does not produce the same results as before the changes, even though the data, model, and inference are all the same. To reproduce the exact paper results, one must git checkout to a version of our repository from before these changes:

(venv)$ git checkout 31d161e

The commands below will reproduce all of the paper results. The following data is required before running the commands:

  • Senate votes: The original raw data can be found at [2]. The paper includes experiments for Senate sessions 111-114. For each Senate session, we need three files: one for votes, one for members, and one for rollcalls. For example, for Senate session 114, we would use the files: S114_votes.csv, S114_members.csv, S114_rollcalls.csv. Make a repo data/senate-votes and store these three files in data/senate-votes/114/raw/. Repeat for Senate sessions 111-113.
  • Senate speeches: The original raw data can be found at [1]. Specifically, we use the hein-daily data for the 114th Senate session. The files needed are speeches_114.txt, descr_114.txt, and 114_SpeakerMap.txt. Make sure the relevant files are stored in data/senate-speeches-114/raw/.
  • Senator tweets: The data was provided to us by Voxgov [3].
  • Senate speech comparisons: We use a separate data set for the Senate speech comparisons because speech debates must be labeled for Wordshoal. The raw data can be found at [4]. The paper includes experiments for Senate sessions 111-113. We need the files speaker_senator_link_file.csv, speeches_Senate_111.tab, speeches_Senate_112.tab, and speeches_Senate_113.tab. These files should all be stored in data/senate-speech-comparisons/raw/.
  • Democratic presidential candidate tweets: Download the raw tweets here and store tweets.csv in the folder data/candidate-tweets-2020/raw/.

Preprocess, run vote ideal point model, and perform analysis for Senate votes

(venv)$ python setup/preprocess_senate_votes.py  --senate_session=111
(venv)$ python setup/preprocess_senate_votes.py  --senate_session=112
(venv)$ python setup/preprocess_senate_votes.py  --senate_session=113
(venv)$ python setup/preprocess_senate_votes.py  --senate_session=114
(venv)$ python setup/vote_ideal_points.py  --senate_session=111
(venv)$ python setup/vote_ideal_points.py  --senate_session=112
(venv)$ python setup/vote_ideal_points.py  --senate_session=113
(venv)$ python setup/vote_ideal_points.py  --senate_session=114
(venv)$ python analysis/analyze_vote_ideal_points.py

Preprocess, run the TBIP, and perform analysis for Senate speeches for the 114th Senate

(venv)$ python setup/senate_speeches_to_bag_of_words.py
(venv)$ python setup/poisson_factorization.py  --data=senate-speeches-114
(venv)$ python tbip.py  --data=senate-speeches-114  --counts_transformation=log  --batch_size=512  --max_steps=150000
(venv)$ python analysis/analyze_senate_speeches.py

Preprocess, run the TBIP and Wordfish, and perform analysis for tweets from senators during the 114th Senate

(venv)$ python setup/senate_tweets_to_bag_of_words.py
(venv)$ python setup/poisson_factorization.py  --data=senate-tweets-114
(venv)$ python tbip.py  --data=senate-tweets-114  --batch_size=1024  --max_steps=100000
(venv)$ python model_comparison/wordfish.py  --data=senate-tweets-114  --max_steps=50000
(venv)$ python analysis/analyze_senate_tweets.py

Preprocess and run the TBIP for Senate speech comparisons

(venv)$ python setup/preprocess_senate_speech_comparisons.py  --senate_session=111
(venv)$ python setup/preprocess_senate_speech_comparisons.py  --senate_session=112
(venv)$ python setup/preprocess_senate_speech_comparisons.py  --senate_session=113
(venv)$ python setup/poisson_factorization.py  --data=senate-speech-comparisons  --senate_session=111
(venv)$ python setup/poisson_factorization.py  --data=senate-speech-comparisons  --senate_session=112
(venv)$ python setup/poisson_factorization.py  --data=senate-speech-comparisons  --senate_session=113
(venv)$ python tbip.py  --data=senate-speech-comparisons  --max_steps=200000  --senate_session=111  --batch_size=128
(venv)$ python tbip.py  --data=senate-speech-comparisons  --max_steps=200000  --senate_session=112  --batch_size=128
(venv)$ python tbip.py  --data=senate-speech-comparisons  --max_steps=200000  --senate_session=113  --batch_size=128

Run Wordfish for Senate speech comparisons

(venv)$ python model_comparison/wordfish.py  --data=senate-speech-comparisons  --max_steps=50000  --senate_session=111
(venv)$ python model_comparison/wordfish.py  --data=senate-speech-comparisons  --max_steps=50000  --senate_session=112 
(venv)$ python model_comparison/wordfish.py  --data=senate-speech-comparisons  --max_steps=50000  --senate_session=113

Run Wordshoal for Senate speech comparisons

(venv)$ python model_comparison/wordshoal.py  --data=senate-speech-comparisons  --max_steps=30000  --senate_session=111  --batch_size=1024
(venv)$ python model_comparison/wordshoal.py  --data=senate-speech-comparisons  --max_steps=30000  --senate_session=112  --batch_size=1024
(venv)$ python model_comparison/wordshoal.py  --data=senate-speech-comparisons  --max_steps=30000  --senate_session=113  --batch_size=1024

Analyze results for Senate speech comparisons

(venv)$ python analysis/compare_tbip_wordfish_wordshoal.py

Preprocess, run the TBIP, and perform analysis for Democratic candidate tweets

(venv)$ python setup/candidate_tweets_to_bag_of_words.py
(venv)$ python setup/poisson_factorization.py  --data=candidate-tweets-2020
(venv)$ python tbip.py  --data=candidate-tweets-2020  --batch_size=1024  --max_steps=100000
(venv)$ python analysis/analyze_candidate_tweets.py

Make figures

(venv)$ python analysis/make_figures.py

References

[1] Matthew Gentzkow, Jesse M. Shapiro, and Matt Taddy. Congressional Record for the 43rd-114th Congresses: Parsed Speeches and Phrase Counts. Palo Alto, CA: Stanford Libraries [distributor], 2018-01-16. https://data.stanford.edu/congress_text

[2] Jeffrey B. Lewis, Keith Poole, Howard Rosenthal, Adam Boche, Aaron Rudkin, and Luke Sonnet (2020). Voteview: Congressional Roll-Call Votes Database. https://voteview.com/

[3] VoxGovFEDERAL, U.S. Senators tweets from the 114th Congress. 2020. https://voxgov.com

[4] Benjamin E. Lauderdale and Alexander Herzog. Replication Data for: Measuring Political Positions from Legislative Speech. In Harvard Dataverse, 2016. https://doi.org/10.7910/DVN/RQMIV3

Owner
Keyon Vafa
Keyon Vafa
AutoPentest-DRL: Automated Penetration Testing Using Deep Reinforcement Learning

AutoPentest-DRL: Automated Penetration Testing Using Deep Reinforcement Learning AutoPentest-DRL is an automated penetration testing framework based o

Cyber Range Organization and Design Chair 217 Jan 01, 2023
Gradient Step Denoiser for convergent Plug-and-Play

Source code for the paper "Gradient Step Denoiser for convergent Plug-and-Play"

Samuel Hurault 11 Sep 17, 2022
PyTorch implementation for Graph Contrastive Learning with Augmentations

Graph Contrastive Learning with Augmentations PyTorch implementation for Graph Contrastive Learning with Augmentations [poster] [appendix] Yuning You*

Shen Lab at Texas A&M University 382 Dec 15, 2022
Efficient Training of Visual Transformers with Small Datasets

Official codes for "Efficient Training of Visual Transformers with Small Datasets", NerIPS 2021.

Yahui Liu 112 Dec 25, 2022
The source code of the paper "SHGNN: Structure-Aware Heterogeneous Graph Neural Network"

SHGNN: Structure-Aware Heterogeneous Graph Neural Network The source code and dataset of the paper: SHGNN: Structure-Aware Heterogeneous Graph Neural

Wentao Xu 7 Nov 13, 2022
PyTorch implementation of Neural Dual Contouring.

NDC PyTorch implementation of Neural Dual Contouring. Citation We are still writing the paper while adding more improvements and applications. If you

Zhiqin Chen 140 Dec 26, 2022
Constructing interpretable quadratic accuracy predictors to serve as an objective function for an IQCQP problem that represents NAS under latency constraints and solve it with efficient algorithms.

IQNAS: Interpretable Integer Quadratic programming Neural Architecture Search Realistic use of neural networks often requires adhering to multiple con

0 Oct 24, 2021
NeuroMorph: Unsupervised Shape Interpolation and Correspondence in One Go

NeuroMorph: Unsupervised Shape Interpolation and Correspondence in One Go This repository provides our implementation of the CVPR 2021 paper NeuroMorp

Meta Research 35 Dec 08, 2022
WPPNets: Unsupervised CNN Training with Wasserstein Patch Priors for Image Superresolution

WPPNets: Unsupervised CNN Training with Wasserstein Patch Priors for Image Superresolution This code belongs to the paper [1] available at https://arx

Fabian Altekrueger 5 Jun 02, 2022
This is the official Pytorch-version code of FlatGCN (Flattened Graph Convolutional Networks for Recommendation).

FlatGCN This is the official Pytorch-version code of FlatGCN (Flattened Graph Convolutional Networks for Recommendation, submitted to ICASSP2022). Req

Dreamer 2 Aug 09, 2022
sense-py-AnishaBaishya created by GitHub Classroom

Compute Statistics Here we compute statistics for a bunch of numbers. This project uses the unittest framework to test functionality. Pass the tests T

1 Oct 21, 2021
Joint project of the duo Hacker Ninjas

Project Smoothie Společný projekt dua Hacker Ninjas. První pokus o hříčku po třech týdnech učení se programování. Jakub Kolář e:\

Jakub Kolář 2 Jan 07, 2022
Inverse Rendering for Complex Indoor Scenes: Shape, Spatially-Varying Lighting and SVBRDF From a Single Image

Inverse Rendering for Complex Indoor Scenes: Shape, Spatially-Varying Lighting and SVBRDF From a Single Image (Project page) Zhengqin Li, Mohammad Sha

209 Jan 05, 2023
EGNN - Implementation of E(n)-Equivariant Graph Neural Networks, in Pytorch

EGNN - Pytorch Implementation of E(n)-Equivariant Graph Neural Networks, in Pytorch. May be eventually used for Alphafold2 replication. This

Phil Wang 259 Jan 04, 2023
Official repository of the AAAI'2022 paper "Contrast and Generation Make BART a Good Dialogue Emotion Recognizer"

CoG-BART Contrast and Generation Make BART a Good Dialogue Emotion Recognizer Quick Start: To run the model on test sets of four datasets, Download th

39 Dec 24, 2022
The self-supervised goal reaching benchmark introduced in Discovering and Achieving Goals via World Models

Lexa-Benchmark Codebase for the self-supervised goal reaching benchmark introduced in 'Discovering and Achieving Goals via World Models'. Setup Create

1 Oct 14, 2021
Probabilistic Programming and Statistical Inference in PyTorch

PtStat Probabilistic Programming and Statistical Inference in PyTorch. Introduction This project is being developed during my time at Cogent Labs. The

Stefano Peluchetti 109 Nov 26, 2022
A Rao-Blackwellized Particle Filter for 6D Object Pose Tracking

PoseRBPF: A Rao-Blackwellized Particle Filter for 6D Object Pose Tracking PoseRBPF Paper Self-supervision Paper Pose Estimation Video Robot Manipulati

NVIDIA Research Projects 107 Dec 25, 2022
IJCAI2020 & IJCV 2020 :city_sunrise: Unsupervised Scene Adaptation with Memory Regularization in vivo

Seg_Uncertainty In this repo, we provide the code for the two papers, i.e., MRNet:Unsupervised Scene Adaptation with Memory Regularization in vivo, IJ

Zhedong Zheng 348 Jan 05, 2023
TransCD: Scene Change Detection via Transformer-based Architecture

TransCD: Scene Change Detection via Transformer-based Architecture

wangzhixue 29 Dec 11, 2022