System-oriented IR evaluations are limited to rather abstract understandings of real user behavior

Last update: Nov 23, 2022

Related tags

Deep Learning ecir2022-uqv-sim

Overview

Validating Simulations of User Query Variants

This repository contains the scripts of the experiments and evaluations, simulated queries, as well as the figures of:

Timo Breuer, Norbert Fuhr, and Philipp Schaer. 2022. Validating Simulations of User Query Variants. In Proceedings of the 44th European Conference on IR Research, ECIR 2022.

System-oriented IR evaluations are limited to rather abstract understandings of real user behavior. As a solution, simulating user interactions provides a cost-efficient way to support system-oriented experiments with more realistic directives when no interaction logs are available. While there are several user models for simulated clicks or result list interactions, very few attempts have been made towards query simulations, and it has not been investigated if these can reproduce properties of real queries. In this work, we validate simulated user query variants with the help of TREC test collections in reference to real user queries that were made for the corresponding topics. Besides, we introduce a simple yet effective method that gives better reproductions of real queries than the established methods. Our evaluation framework validates the simulations regarding the retrieval performance, reproducibility of topic score distributions, shared task utility, effort and effect, and query term similarity when compared with real user query variants. While the retrieval effectiveness and statistical properties of the topic score distributions as well as economic aspects are close to that of real queries, it is still challenging to simulate exact term matches and later query reformulations.

Directory overview

Directory	Description
`config/`	Contains configuration files for the query simulations, experiments, and evaluations.
`data/`	Contains (intermediate) output data of the simulations and experiments as well as the figures of the paper.
`eval/`	Contains scripts of the experiments and evaluations.
`sim/`	Contains scripts of the query simulations.

Setup

Install Anserini and index Core17 (The New York Times Annotated Corpus) according to the regression guide:

anserini/target/appassembler/bin/IndexCollection \
    -collection NewYorkTimesCollection \
    -input /path/to/core17/ \
    -index anserini/indexes/lucene-index.core17 \
    -generator DefaultLuceneDocumentGenerator \
    -threads 4 \
    -storePositions \
    -storeDocvectors \
    -storeRaw \
    -storeContents \
    > anserini/logs/log.core17 &

Install the required Python packages:

pip install -r requirements.txt

Query simulation

In order to prepare the language models and simulate the queries, the scripts have to executed in the order shown in the following table. All of the outputs can be found in the data/ directory. For the sake of better code readability the names of the query reformulation strategies have been mapped: S1 → S1; S2 → S2; S2' → S3; S3 → S4; S3' → S5; S4 → S6; S4' → S7; S4'' → S8. The names of the scripts and output files comply with this name mapping.

Script	Description	Output files
`sim/make_background.py`	Make the background language model form all index terms of Core17. The background model is required for Controlled Query Generation (CQG) by Jordan et al.	`data/lm/background.csv`
`sim/make_cqg.py`	Make the CQG language models with different parameters of lambda from 0.0 to 1.0.	`data/lm/cqg.json`
`sim/simulate_queries_s12345.py`	Simulate TTS and KIS queries with strategies S1 to S3'	`data/queries/s12345.csv`
`sim/simulate_queries_s678.py`	Simulate TTS and KIS queries with strategies S4 to S4''	`data/queries/s678.csv`

Experimental evaluation and results

In order to reproduce the experiments of the study, the scripts have to executed in the order shown in the following table.

Script	Description	Output files	Reproduction of ...
`eval/arp.py`, `eval/arp_first.py`, `eval/arp_max.py`	Retrieval performance: Evaluate the Average Retrieval Performance (ARP).	`data/experimental_results/arp.csv`, `data/experimental_results/arp_first.csv`, `data/experimental_results/arp_max.csv`	`Tab. A.1`
`eval/rmse_s12345.py`, `eval/rmse_s678.py`	Retrieval performance: Evaluate the Root-Mean-Square-Error (RMSE).	`data/experimental_results/rmse_map.csv`, `data/experimental_results/rmse_ndcg.csv`, `data/experimental_results/rmse_p1000.csv`, `data/experimental_results/rmse_uqv_vs_s12345_kis_ndcg.csv`, `data/experimental_results/rmse_uqv_vs_s12345_tts_ndcg.csv`, `data/figures/rmse_map.pdf`, `data/figures/rmse_ndcg.pdf`, `data/figures/rmse_p1000.pdf`, `data/figures/rmse_uqv_vs_s12345_kis_ndcg.pdf`, `data/figures/rmse_uqv_vs_s12345_tts_ndcg.pdf`	`Fig. A.1`, `Fig. 1`
`eval/t-test.py`	Retrieval performance: Evaluate the p-values of paired t-tests.	`data/experimental_results/ttest.csv`, `data/figures/ttest.pdf`	`Fig. A.2`
`eval/system_orderings.py`	Shared task utility: Evaluate Kendall's tau between relative system orderings.	`data/experimental_results/system_orderings.csv`, `data/figures/system_orderings.pdf`	`Fig. 2 (left)`
`eval/sdcg.py`	Effort and effect: Evaluate the Session Discounted Cumulative Gain (sDCG).	`data/experimental_results/sdcg_3queries.csv`, `data/experimental_results/sdcg_5queries.csv`, `data/experimental_results/sdcg_10queries.csv`, `data/figures/sdcg_3queries.pdf`, `data/figures/sdcg_5queries.pdf`, `data/figures/sdcg_10queries.pdf`	`Fig. 3 (top)`
`eval/economic.py`	Effort and effect: Evaluate tradeoffs between number of queries and browsing depth by isoquants.	`data/experimental_results/economic0.3.csv`, `data/experimental_results/economic0.4.csv`, `data/experimental_results/economic0.5.csv`, `data/figures/economic0.3.pdf`, `data/figures/economic0.4.pdf`, `data/figures/economic0.5.pdf`	`Fig. 3 (bottom)`
`eval/jaccard_similarity.py`	Query term similarity: Evaluate query term similarities.	`data/experimental_results/jacc.csv`, `data/figures/jacc.pdf`	`Fig. 2 (right)`

System-oriented IR evaluations are limited to rather abstract understandings of real user behavior

Related tags

Overview

Validating Simulations of User Query Variants

Directory overview

Setup

Query simulation

Experimental evaluation and results

Owner

IR Group at Technische Hochschule Köln

TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.

DNA-RECON { Automatic Web Reconnaissance Tool }

Code release for NeuS

Official implementation of "Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision" ECCV2020

This repository is an official implementation of the paper MOTR: End-to-End Multiple-Object Tracking with TRansformer.

DPC: Unsupervised Deep Point Correspondence via Cross and Self Construction (3DV 2021)

App customer segmentation cohort rfm clustering

Code release for the ICML 2021 paper "PixelTransformer: Sample Conditioned Signal Generation".

A new play-and-plug method of controlling an existing generative model with conditioning attributes and their compositions.

QAT(quantize aware training) for classification with MQBench

OpenMMLab Detection Toolbox and Benchmark

A Rao-Blackwellized Particle Filter for 6D Object Pose Tracking

Transfer-Learn is an open-source and well-documented library for Transfer Learning.

A Demo server serving Bert through ONNX with GPU written in Rust with <3

Turn based roguelike in python

Distributing reference energies for SMIRNOFF implementations

Python implementation of "Single Image Haze Removal Using Dark Channel Prior"

Accommodating supervised learning algorithms for the historical prices of the world's favorite cryptocurrency and boosting it through LightGBM.

HyperLib: Deep learning in the Hyperbolic space

Code for ECIR'20 paper Diagnosing BERT with Retrieval Heuristics