Text-Based Ideal Points

Related tags

Deep Learningtbip
Overview

Text-Based Ideal Points

Source code for the paper: Text-Based Ideal Points by Keyon Vafa, Suresh Naidu, and David Blei (ACL 2020).

Update (June 29, 2020): We have added interactive visualizations of topics learned by our model.

Update (May 25, 2020): We have added a PyTorch implementation of the text-based ideal point model.

Update (May 11, 2020): See our Colab notebook to run the model online. Our Github code is more complete, and it can be used to reproduce all of our experiments. However, the TBIP is fastest on GPU, so if you do not have access to a GPU you can use Colab's GPUs for free.

Installation for GPU

Configure a virtual environment using Python 3.6+ (instructions here). Inside the virtual environment, use pip to install the required packages:

(venv)$ pip install -r requirements.txt

The main dependencies are Tensorflow (1.14.0) and Tensorflow Probability (0.7.0).

Installation for CPU

To run on CPU, a version of Tensorflow that does not use GPU must be installed. In requirements.txt, comment out the line that says tensorflow-gpu==1.14.0 and uncomment the line that says tensorflow==1.14.0. Note: the script will be noticeably slower on CPU.

Data

Preprocessed Senate speech data for the 114th Congress is included in data/senate-speeches-114. The original data is from [1]. Preprocessed 2020 Democratic presidential candidate tweet data is included in data/candidate-tweets-2020.

To include a customized data set, first create a repo data/{dataset_name}/clean/. The following four files must be inside this folder:

  • counts.npz: a [num_documents, num_words] sparse CSR matrix containing the word counts for each document.
  • author_indices.npy: a [num_documents] vector where each entry is an integer in the set {0, 1, ..., num_authors - 1}, indicating the author of the corresponding document in counts.npz.
  • vocabulary.txt: a [num_words]-length file where each line denotes the corresponding word in the vocabulary.
  • author_map.txt: a [num_authors]-length file where each line denotes the name of an author in the corpus.

See data/senate-speeches-114/clean for an example of what the four files look like for Senate speeches. The script setup/senate_speeches_to_bag_of_words.py contains example code for creating the four files from unprocessed data.

Learning text-based ideal points

Run tbip.py to produce ideal points. For the Senate speech data, use the command:

(venv)$ python tbip.py  --data=senate-speeches-114  --batch_size=512  --max_steps=100000

You can view Tensorboard while training to see summaries of training (including the learned ideal points and ideological topics). To run Tensorboard, use the command:

(venv)$ tensorboard  --logdir=data/senate-speeches-114/tbip-fits/  --port=6006

The command should output a link where you can view the Tensorboard results in real time. The fitted parameters will be stored in data/senate-speeches-114/tbip-fits/params. To perform the above analyses for the 2020 Democratic candidate tweets, replace senate-speeches-114 with candidate-tweets-2020.

To run custom data, we recommend training Poisson factorization before running the TBIP script for best results. If you have custom data stored in data/{dataset_name}/clean/, you can run

(venv)$ python setup/poisson_factorization.py  --data={dataset_name}

The default number of topics is 50. To use a different number of topics, e.g. 100, use the flag --num_topics=100. After Poisson factorization finishes, use the following command to run the TBIP:

(venv)$ python tbip.py  --data={dataset_name}

You can adjust the batch size, learning rate, number of topics, and number of steps by using the flags --batch_size, --learning_rate, --num_topics, and --max_steps, respectively. To run the TBIP without initializing from Poisson factorization, use the flag --pre_initialize_parameters=False. To view the results in Tensorboard, run

(venv)$ tensorboard  --logdir=data/{dataset_name}/tbip-fits/  --port=6006

Again, the learned parameters will be stored in data/{dataset_name}/tbip-fits/params.

Reproducing Paper Results

NOTE: Since the publication of our paper, we have made small changes to the code that have sped up inference. A byproduct of these changes is that the Tensorflow graph has changed, so its random seed does not produce the same results as before the changes, even though the data, model, and inference are all the same. To reproduce the exact paper results, one must git checkout to a version of our repository from before these changes:

(venv)$ git checkout 31d161e

The commands below will reproduce all of the paper results. The following data is required before running the commands:

  • Senate votes: The original raw data can be found at [2]. The paper includes experiments for Senate sessions 111-114. For each Senate session, we need three files: one for votes, one for members, and one for rollcalls. For example, for Senate session 114, we would use the files: S114_votes.csv, S114_members.csv, S114_rollcalls.csv. Make a repo data/senate-votes and store these three files in data/senate-votes/114/raw/. Repeat for Senate sessions 111-113.
  • Senate speeches: The original raw data can be found at [1]. Specifically, we use the hein-daily data for the 114th Senate session. The files needed are speeches_114.txt, descr_114.txt, and 114_SpeakerMap.txt. Make sure the relevant files are stored in data/senate-speeches-114/raw/.
  • Senator tweets: The data was provided to us by Voxgov [3].
  • Senate speech comparisons: We use a separate data set for the Senate speech comparisons because speech debates must be labeled for Wordshoal. The raw data can be found at [4]. The paper includes experiments for Senate sessions 111-113. We need the files speaker_senator_link_file.csv, speeches_Senate_111.tab, speeches_Senate_112.tab, and speeches_Senate_113.tab. These files should all be stored in data/senate-speech-comparisons/raw/.
  • Democratic presidential candidate tweets: Download the raw tweets here and store tweets.csv in the folder data/candidate-tweets-2020/raw/.

Preprocess, run vote ideal point model, and perform analysis for Senate votes

(venv)$ python setup/preprocess_senate_votes.py  --senate_session=111
(venv)$ python setup/preprocess_senate_votes.py  --senate_session=112
(venv)$ python setup/preprocess_senate_votes.py  --senate_session=113
(venv)$ python setup/preprocess_senate_votes.py  --senate_session=114
(venv)$ python setup/vote_ideal_points.py  --senate_session=111
(venv)$ python setup/vote_ideal_points.py  --senate_session=112
(venv)$ python setup/vote_ideal_points.py  --senate_session=113
(venv)$ python setup/vote_ideal_points.py  --senate_session=114
(venv)$ python analysis/analyze_vote_ideal_points.py

Preprocess, run the TBIP, and perform analysis for Senate speeches for the 114th Senate

(venv)$ python setup/senate_speeches_to_bag_of_words.py
(venv)$ python setup/poisson_factorization.py  --data=senate-speeches-114
(venv)$ python tbip.py  --data=senate-speeches-114  --counts_transformation=log  --batch_size=512  --max_steps=150000
(venv)$ python analysis/analyze_senate_speeches.py

Preprocess, run the TBIP and Wordfish, and perform analysis for tweets from senators during the 114th Senate

(venv)$ python setup/senate_tweets_to_bag_of_words.py
(venv)$ python setup/poisson_factorization.py  --data=senate-tweets-114
(venv)$ python tbip.py  --data=senate-tweets-114  --batch_size=1024  --max_steps=100000
(venv)$ python model_comparison/wordfish.py  --data=senate-tweets-114  --max_steps=50000
(venv)$ python analysis/analyze_senate_tweets.py

Preprocess and run the TBIP for Senate speech comparisons

(venv)$ python setup/preprocess_senate_speech_comparisons.py  --senate_session=111
(venv)$ python setup/preprocess_senate_speech_comparisons.py  --senate_session=112
(venv)$ python setup/preprocess_senate_speech_comparisons.py  --senate_session=113
(venv)$ python setup/poisson_factorization.py  --data=senate-speech-comparisons  --senate_session=111
(venv)$ python setup/poisson_factorization.py  --data=senate-speech-comparisons  --senate_session=112
(venv)$ python setup/poisson_factorization.py  --data=senate-speech-comparisons  --senate_session=113
(venv)$ python tbip.py  --data=senate-speech-comparisons  --max_steps=200000  --senate_session=111  --batch_size=128
(venv)$ python tbip.py  --data=senate-speech-comparisons  --max_steps=200000  --senate_session=112  --batch_size=128
(venv)$ python tbip.py  --data=senate-speech-comparisons  --max_steps=200000  --senate_session=113  --batch_size=128

Run Wordfish for Senate speech comparisons

(venv)$ python model_comparison/wordfish.py  --data=senate-speech-comparisons  --max_steps=50000  --senate_session=111
(venv)$ python model_comparison/wordfish.py  --data=senate-speech-comparisons  --max_steps=50000  --senate_session=112 
(venv)$ python model_comparison/wordfish.py  --data=senate-speech-comparisons  --max_steps=50000  --senate_session=113

Run Wordshoal for Senate speech comparisons

(venv)$ python model_comparison/wordshoal.py  --data=senate-speech-comparisons  --max_steps=30000  --senate_session=111  --batch_size=1024
(venv)$ python model_comparison/wordshoal.py  --data=senate-speech-comparisons  --max_steps=30000  --senate_session=112  --batch_size=1024
(venv)$ python model_comparison/wordshoal.py  --data=senate-speech-comparisons  --max_steps=30000  --senate_session=113  --batch_size=1024

Analyze results for Senate speech comparisons

(venv)$ python analysis/compare_tbip_wordfish_wordshoal.py

Preprocess, run the TBIP, and perform analysis for Democratic candidate tweets

(venv)$ python setup/candidate_tweets_to_bag_of_words.py
(venv)$ python setup/poisson_factorization.py  --data=candidate-tweets-2020
(venv)$ python tbip.py  --data=candidate-tweets-2020  --batch_size=1024  --max_steps=100000
(venv)$ python analysis/analyze_candidate_tweets.py

Make figures

(venv)$ python analysis/make_figures.py

References

[1] Matthew Gentzkow, Jesse M. Shapiro, and Matt Taddy. Congressional Record for the 43rd-114th Congresses: Parsed Speeches and Phrase Counts. Palo Alto, CA: Stanford Libraries [distributor], 2018-01-16. https://data.stanford.edu/congress_text

[2] Jeffrey B. Lewis, Keith Poole, Howard Rosenthal, Adam Boche, Aaron Rudkin, and Luke Sonnet (2020). Voteview: Congressional Roll-Call Votes Database. https://voteview.com/

[3] VoxGovFEDERAL, U.S. Senators tweets from the 114th Congress. 2020. https://voxgov.com

[4] Benjamin E. Lauderdale and Alexander Herzog. Replication Data for: Measuring Political Positions from Legislative Speech. In Harvard Dataverse, 2016. https://doi.org/10.7910/DVN/RQMIV3

Owner
Keyon Vafa
Keyon Vafa
It helps user to learn Pick-up lines and share if he has a better one

Pick-up-Lines-Generator(Open Source) It helps user to learn Pick-up lines Share and Add one or many to the DataBase Unique SQLite DataBase AI Undercon

knock_nott 0 May 04, 2022
TinyML Cookbook, published by Packt

TinyML Cookbook This is the code repository for TinyML Cookbook, published by Packt. Author: Gian Marco Iodice Publisher: Packt About the book This bo

Packt 93 Dec 29, 2022
Repositório da disciplina de APC, no segundo semestre de 2021

NOTAS FINAIS: https://github.com/fabiommendes/apc2018/blob/master/nota-final.pdf Algoritmos e Programação de Computadores Este é o Git da disciplina A

16 Dec 16, 2022
Learning cell communication from spatial graphs of cells

ncem Features Repository for the manuscript Fischer, D. S., Schaar, A. C. and Theis, F. Learning cell communication from spatial graphs of cells. 2021

Theis Lab 77 Dec 30, 2022
An introduction to bioimage analysis - http://bioimagebook.github.io

Introduction to Bioimage Analysis This book tries explain the main ideas of image analysis in a practical and engaging way. It's written primarily for

Bioimage Book 20 Nov 28, 2022
Create UIs for prototyping your machine learning model in 3 minutes

Note: We just launched Hosted, where anyone can upload their interface for permanent hosting. Check it out! Welcome to Gradio Quickly create customiza

Gradio 11.7k Jan 07, 2023
particle tracking model, works with the ROMS output file(qck.nc, his.nc)

particle-tracking-model-for-ROMS particle tracking model, works with the ROMS output file(qck.nc, his.nc) description this is a 2-dimensional particle

xusheng 1 Jan 11, 2022
A library to inspect itermediate layers of PyTorch models.

A library to inspect itermediate layers of PyTorch models. Why? It's often the case that we want to inspect intermediate layers of a model without mod

archinet.ai 380 Dec 28, 2022
An implementation of EWC with PyTorch

EWC.pytorch An implementation of Elastic Weight Consolidation (EWC), proposed in James Kirkpatrick et al. Overcoming catastrophic forgetting in neural

Ryuichiro Hataya 166 Dec 22, 2022
VarCLR: Variable Semantic Representation Pre-training via Contrastive Learning

    VarCLR: Variable Representation Pre-training via Contrastive Learning New: Paper accepted by ICSE 2022. Preprint at arXiv! This repository contain

squaresLab 32 Oct 24, 2022
PyTorch implementation of Value Iteration Networks (VIN): Clean, Simple and Modular. Visualization in Visdom.

VIN: Value Iteration Networks This is an implementation of Value Iteration Networks (VIN) in PyTorch to reproduce the results.(TensorFlow version) Key

Xingdong Zuo 215 Dec 07, 2022
Tensors and Dynamic neural networks in Python with strong GPU acceleration

PyTorch is a Python package that provides two high-level features: Tensor computation (like NumPy) with strong GPU acceleration Deep neural networks b

61.4k Jan 04, 2023
CROSS-LINGUAL ABILITY OF MULTILINGUAL BERT: AN EMPIRICAL STUDY

M-BERT-Study CROSS-LINGUAL ABILITY OF MULTILINGUAL BERT: AN EMPIRICAL STUDY Motivation Multilingual BERT (M-BERT) has shown surprising cross lingual a

CogComp 1 Feb 28, 2022
Large-scale language modeling tutorials with PyTorch

Large-scale language modeling tutorials with PyTorch 안녕하세요. 저는 TUNiB에서 머신러닝 엔지니어로 근무 중인 고현웅입니다. 이 자료는 대규모 언어모델 개발에 필요한 여러가지 기술들을 소개드리기 위해 마련하였으며 기본적으로

TUNiB 172 Dec 29, 2022
Codes for AAAI22 paper "Learning to Solve Travelling Salesman Problem with Hardness-Adaptive Curriculum"

Paper For more details, please see our paper Learning to Solve Travelling Salesman Problem with Hardness-Adaptive Curriculum which has been accepted a

14 Sep 30, 2022
This repo provides the base code for pytorch-lightning and weight and biases simultaneous integration.

Write your model faster with pytorch-lightning-wadb-code-backbone This repository provides the base code for pytorch-lightning and weight and biases s

9 Mar 29, 2022
This is a project based on ConvNets used to identify whether a road is clean or dirty. We have used MobileNet as our base architecture and the weights are based on imagenet.

PROJECT TITLE: CLEAN/DIRTY ROAD DETECTION USING TRANSFER LEARNING Description: This is a project based on ConvNets used to identify whether a road is

Faizal Karim 3 Nov 06, 2022
Official code release for "GRAF: Generative Radiance Fields for 3D-Aware Image Synthesis"

GRAF This repository contains official code for the paper GRAF: Generative Radiance Fields for 3D-Aware Image Synthesis. You can find detailed usage i

349 Dec 29, 2022
CTF challenges from redpwnCTF 2021

redpwnCTF 2021 Challenges This repository contains challenges from redpwnCTF 2021 in the rCDS format; challenge information is in the challenge.yaml f

redpwn 27 Dec 07, 2022
A hand tracking demo made with mediapipe where you can control lights with pinching your fingers and moving your hand up/down.

HandTrackingBrightnessControl A hand tracking demo made with mediapipe where you can control lights with pinching your fingers and moving your hand up

Teemu Laurila 19 Feb 12, 2022