Research Artifact of USENIX Security 2022 Paper: Automated Side Channel Analysis of Media Software with Manifold Learning

Overview

Python 3.8 Pytorch 1.4

Automated Side Channel Analysis of Media Software with Manifold Learning

Official implementation of USENIX Security 2022 paper: Automated Side Channel Analysis of Media Software with Manifold Learning.

Paper link: https://www.usenix.org/conference/usenixsecurity22/presentation/yuan

Extended version: https://arxiv.org/pdf/2112.04947.pdf

Note

Warning: This repo is provided as-is and is only for research purposes. Please use it only on test systems with no sensitive data. You are responsible for protecting yourself, your data, and others from potential risks caused by this repo.

Updates

  • 2021 Oct 9. Released data, code and trained models.

Requirements

  • To build from source code, install following requirements.

    pip install torch==1.4.0
    pip install torchvision==0.5.0
    pip install numpy==1.18.5
    pip install pillow==7.2.0
    pip install opencv-python
    pip install scipy==1.5.0
    pip install matplotlib==3.2.2
    pip install librosa==0.7.2
    pip install progressbar
    

    Then type

    git clone https://github.com/Yuanyuan-Yuan/Manifold-SCA
    cd Manifold-SCA
    export MANIFOLD_SCA=$PWD
  • If you would like to build this repo from docker, see DOCKER and skip the following steps except for 6.1.

0. Output

We provide data and our trained models here.

Run the following scripts to produce outputs from these data samples

python output.py --dataset="CelebA"
python output_blind.py --dataset="CelebA"
python output_noise.py --dataset="CelebA"

You can choose dataset from ["CelebA", "ChestX-ray", "SC09", "Sub-URMP", "COCO", "DailyDialog"]. Results will be saved in output.

We also provide all our outputs.

1. Datasets

We provide sampels of our processed data here.

CelebA

Download the CelebA dataset from here. We use the Align&Cropped Images version.

After downloading the dataset, go to tool. Then run

python crop_celeba.py --input_dir="/path/to/unzipped_images" --output_dir="/path/to/cropped_images"

to crop all images to size of 128*128. We provide several examples in data/CelebA_crop128/image.

ChestX-ray

Download the ChestX-ray dataset from here.

After downloading the dataset, go to tool. Then run

python resize_chest.py --input_dir="/path/to/unzipped_images" --output_dir="/path/to/resized_images"

to convert all images to JPEG format and resize them to size of 128*128. We provide several examples in data/ChestX-ray_jpg128/image.

SC09 & Sub-URMP

Download the SC09 dataset from here and Sub-URMP dataset here.

We process audios in the Log-amplitude of Mel Spectrum (LMS) form, which is a 2D representation. Once the dataset is downloaded, go to tool and run

python audio2lms.py --dataset="{SC09} or {Sub-URMP}" --input_dir="/path/to/audios" --output_dir="/path/to/lms"

to covert all audios to their LMS representations. Several examples are provided in data/SC09/lms and data/Sub-URMP/lms respectively.

COCO-Caption & DailyDialog

Download COCO captions from here. We use the 2014 Train/Val annotations. After downloading you need to extract captions from captions_train2014.json and captions_val2014.json. We provide several examples in data/COCO/text/train.json and data/COCO/text/val.json.

Download DailyDialog dataset from here. After downloading you will have dialogues_train.txt and dialogues_test.txt. We suggest you store the sentences in json files. Several examples are given in data/DailyDialog/text/train.json and data/DailyDialog/text/test.json.

Once the sentences are prepared, you need to build the corresponding vocabulary. Go to tool and run

python build_vocab.py input_path="/path/to/sentences.json" --output_path="/path/to/vocabulary.json" --freq=minimal_word_frequency

to build the vocabulary. We provide our vocabularies in data/COCO/text/word_dict_freq5.json and data/DailyDialog/text/word_dict_freq5.json.

2. Target Software

Install libjpeg, hunspell and ffmpeg.

We already set up the three software in DOCKER.

3. Side Channel Attack

We analyze three common side channels, namely, cache bank, cache line and page tables.

3.1. Prepare Data

We use Intel Pin (Ver. 3.11) to collect the accessed memory addresses of the target software when processing media data. We already set up the Pin in DOCKER.

We provide our pintool in pin/pintool/mem_access.cpp. Download Pin from here and unzip it to PIN_ROOT (specify this path by yourself).

To prepare accessed memory addresses of libjpeg when processing CelebA images, first put pin/pintool/mem_access.cpp into /PIN_ROOT/source/tools/ManualExamples/ and run

make obj-intel64/mem_access.so TARGET=intel64

to compile the pintool. Before collect the memory address, remember to run

setarch $(uname -m) -R /bin/bash

in your bash to disable ASLR. In fact, the ASLR does not affect our approach, so you can also drop the randomized bits of collected memory address.

Then put pin/prep_celeba.py into /PIN_ROOT/source/tools/ManualExamples/ and set the following variables:

  • input_dir - Directory of media data.
  • npz_dir - Directory where the accessed memory addresses will be saved. Addresses of each media data will be saved in .npz format.
  • raw_dir - Directory where the raw output of our pintool will be saved. These files will be used for localize side channel vulnerabilities.
  • libjpeg_path - Path to the executable file of libjpeg.

You can speed up the progress by running multiple processes. Go to /PIN_ROOT/source/tools/ManualExamples/ and simply set variable total_num in *.py to the number of processes and run

python prep_celeba.py --ID=id_starting_from_1

to prepare data. Follow the same procedure for other datasets.

We provide our collected side channel records of all datasets here.

3.2. Map Memory Addresses to Side Channels

We map the collected memory addresses addr to side channels according to the following table.

CPU Cache Bank Index CPU Cache Line Index OS Page Table Index
addr >> 2 addr >> 6 addr >> 12

Set the following variables in tool/addr2side.py.

  • input_dir - Directory of collected .npz files recording accessed memory addresses.
  • cachebank_dir - Directory of converted cache bank indexes.
  • cacheline_dir - Directory of converted cache line indexes.
  • pagetable_dir - Directory of converted page table indexes.

Then run

python addr2side.py

to get the side channels records. You can also speed up the progress by running multiple processes.

3.3. Reconstruct Private Media Data

You need to first customize following data directories in code/data_path.json.

{ 
    "dataset_name": {
        "media": "/path/to/media_data/",
        "cachebank": "/path/to/cache_bank/",
        "cacheline": "/path/to/cache_line/",
        "pagetable": "/path/to/page_table/",
        "split": ["train", "test"]
        },
}

To approximate the manifold of face photos from cache line indexes, go to code and run

python recons_image.py --exp_name="CelebA_cacheline" --dataset="CelebA" --side="cacheline" 

The recons_image.py script approximates manifold using the train split of CelebA dataset and ends within 24 hours on one Nvidia GeForce RTX 2080 GPU. Outputs (e.g., trained models, logs) will by default be saved in output/CelebA_cacheline. You can customize the output directory by setting --output_root="/path/to/output/". The procedure is same for other media data (i.e., audio, text).

Once the desired manifold is constructed, run

python output.py --exp_name="CelebA_cacheline" --dataset="CelebA" --side="cacheline"

to reconstruct unknown face photos (i.e., the test split). The reconstructed face photos will by default be saved in output/CelebA_cacheline/recons/. This procedure is also same for audio and text data.

We use Face++ to assess the similarity of IDs between reconstructed and reference face photos. The online service is free at the time of writing, so you can register your own account. Then set the key and secret variables in code/face_similarity.py and run

python face_similarity.py --recons_dir="../output/CelebA_cacheline/recons/" --target_dir="../output/CelebA_cacheline/target/" --output_path="../output/CelebA_cacheline/simillarity.txt"

The results will by default be saved in output/CelebA_cacheline/simillarity.txt.

For ChestX-ray images, we use this tool to check the consistency between disease information of reconstructed reference images.

You can also evaluate the similarity between reconstructed and reference images by running tool/SSIM.py.

python SSIM.py --K=1 --N=100 --recons_dir="../output/CelebA_cacheline/recons/" --target_dir="../output/CelebA_cacheline/target/" --output_path="../output/CelebA_cacheline/SSIM.txt"

The evaluation methods of audio data and text data are implemented in code/recons_audio.py and code/recons_text.py respectively. Note that the reconstructed audios are in the LMS representation, to get the raw audio (i.e., .wav format), run

python lms2audio.py --input_dir="/path/to/lms" --output_dir="/path/to/wav"

If you want to use your customrized dataset, write your dataset class in code/data_loader.py.

We also provide our trained models.

4. Program Point Localization

Once you successfully perform side channel attacks on the target softwares, you can localize the side channel vulnerabilities.

First customize the following variables in code/data_path.json.

{ 
    "dataset_name": {
        "pin": "/path/to/pintool_output/",
        },
}

Then go to code and run.

python localize.py --exp_name="CelebA_cacheline" --dataset="CelebA" --side="cacheline"

The output .json file will be saved in output/CelebA_cacheline/localize. The results are organized as

{
    "function_name; assmbly instruction; instruction address": "count",
}

The results of media software (e.g., libjpeg) processing different data (e.g., CelebA and ChestX-ray) are mostly consistent. We provide our localized vulnerabilities.

5. Perception Blinding

We provid the blinded media data here.

To blind media data, go to code and run

python blind_add.py --meida="{image} or {audio} or {text}" --mask_weight=0.9 --mask="{mask_word} or {/path/to/mask_image_or_audio}" --input_dir="{/path/to/text.json} or {/foler/of/image_or_audio/}" --output_dir="{/path/to/text.json} or {/foler/of/image_or_audio/}"

To unblind media data, run

python blind_subtract.py --meida="{image} or {audio} or {text}" --mask_weight=0.9 --mask="{mask_word} or {/path/to/mask_image_or_audio}" --input_dir="{/path/to/text.json} or {/foler/of/image_or_audio/}" --output_dir="{/path/to/text.json} or {/foler/of/image_or_audio/}"

Run

python output_blind.py --dataset="CelebA"
# or
python output_blind.py --dataset="ChestX-ray"

to see reconstructed media data from side channels corresponding to blinded data.

6. Attack with Prime+Probe

We use Mastik (Ver. 0.02) to launch Prime+Probe on L1 cache of Intel Xeon CPU and AMD Ryzen CPU. We provide our scripts in prime_probe/Mastik. After downloading Mastik, you can put our scripts in the demo folder and run make in the root folder to compile our scripts. We highly recommend you to set the cache miss threshold in these scripts according to your machines.

The Prime+Probe is launched in Linux OS. You need first to install taskset and cpuset.

6.1. Prepare Data

We assume victim and spy are on the same CPU core and no other process is runing on this CPU core. To attack libjpeg, you need to first customize the following variables in code/prime_probe/coord_image.py

  • pp_exe - Path to the executable file of our prime+probe script.
  • input_dir - Directory of media data.
  • side_dir - Directory where the collected cache set accesses will be saved.
  • libjpeg_path - Path to the executable file of libjpeg.
  • TRY_NUM - Repeating times of processing one media using the target software.
  • PAD_LEN - The length that the collected trace will be padded to.

The script coord_image.py is the coordinator which runs spy and victim on the same CPU core simultaneously and saves the collected cache set access.

Then run

sudo cset shield --cpu {cpu_id}

to isolate one CPU core. Once the CPU core is isolated, you can run

sudo cset shield --exec python run_image.py -- {cpu_id} {segment_id}

The script run_image.py will run coord_image.py using taskset. Note that we seperate the media data into several segments to speed up the side channel collection. The segment_id starts from 0. The procedure is same for other media data.

We provide our logged side channels.

6.2. Reconstruct Private Media Data

First customize the following variables in code/data_path.json.

{ 
    "dataset_name": {
        "pp-intel-dcache": "/path/to/intel_l1_dcache",
        "pp-intel-icache": "/path/to/intel_l1_icache",
        "pp-amd-dcache": "/path/to/amd_l1_dcache",
        "pp-amd-icache": "/path/to/amd_l1_icache",
        },
}

Then run

python pp_image.py --exp_name="CelebA_pp" --dataset="CelebA" --cpu="intel" --cache="dcache"

to approximate the manifold. To reconstruct unknonw images from the collected cache set accesses, uncomment

engine.load_model("/path/to/model.pth")
engine.inference(test_loader, "test")

in pp_image.py. The reconstructed images will be saved in output/CelebA_pp/recons. Follow the same procedure for other media data.

We release our trained models and all SC09 audios reconstructed from side channels collected by Prime+Probe on Intel L1 D cache.

7. Noise Resilience

We have the following noise insertion schemes (see more details in our paper).

Pin logged trace Prime+Probe logged trace
Gaussian Leave out
Shifting False hit/miss
Removal Wrong order

To insert the "shifting" noise into pin logged trace, go to code and run

python output_noise.py --exp_name="CelebA_cacheline" --dataset="CelebA" --side="cacheline" --noise_op="shift" --noise_k=100

Images reconstructed from noisy cache line records will be saved in output/CelebA_cacheline/recons_noise by default.

To insert the "wrong order" noise into prime+probe logged trace, you need to modify code/pp_image.py as

# test_dataset = RealSideDataset(args, split=args.data_path[args.dataset]["split"][1])
test_dataset = NoisyRealSideDataset(args, split=args.data_path[args.dataset]["split"][1])

and uncomment

engine.load_model("/path/to/model.pth")
engine.inference(test_loader, "test")

to reconstruct unknown images from noisy side channel records.

The procedure is same for other media data. Note that in order to assess the noise resilience, you should NOT approximate manifold (i.e., train the model) using the noisy side channel.

8. Customization

All parameters are set in code/params.py. You can customize the hyper-parameters for approximating manifold.

All datasets are implemented with OOP manner in code/data_loader.py. You can modify the dataset class to support your own data.

All models are also implemented with OOP manner in code/model.py. You can build a new framework from new models.

Citation

TBA.

If you have any questions, feel free to contact with me ([email protected]).

Owner
Yuanyuan Yuan
Yuanyuan Yuan
This is the repository for Learning to Generate Piano Music With Sustain Pedals

SusPedal-Gen This is the official repository of Learning to Generate Piano Music With Sustain Pedals Demo Page Dataset The dataset used in this projec

Joann Ching 12 Sep 02, 2022
This is my codes that can visualize the psnr image in testing videos.

CVPR2018-Baseline-PSNRplot This is my codes that can visualize the psnr image in testing videos. Future Frame Prediction for Anomaly Detection – A New

Wenhao Yang 12 May 29, 2021
Expressive Body Capture: 3D Hands, Face, and Body from a Single Image

Expressive Body Capture: 3D Hands, Face, and Body from a Single Image [Project Page] [Paper] [Supp. Mat.] Table of Contents License Description Fittin

Vassilis Choutas 1.3k Jan 07, 2023
Human Detection - Pedestrian Detection using OpenCV Python

Pedestrian Detection using OpenCV Python Follow us on Instagram for Machine Lear

Hrishikesh Dutta 1 Jan 23, 2022
PECOS - Prediction for Enormous and Correlated Spaces

PECOS - Predictions for Enormous and Correlated Output Spaces PECOS is a versatile and modular machine learning (ML) framework for fast learning and i

Amazon 387 Jan 04, 2023
YOLOX-RMPOLY

本算法为适应robomaster比赛,而改动自矩形识别的yolox算法。 基于旷视科技YOLOX,实现对不规则四边形的目标检测 TODO 修改onnx推理模型 更改/添加标注: 1.yolox/models/yolox_polyhead.py: 1.1继承yolox/models/yolo_

3 Feb 25, 2022
tensorrt int8 量化yolov5 4.0 onnx模型

onnx模型转换为 int8 tensorrt引擎

123 Dec 28, 2022
Code and models for ICCV2021 paper "Robust Object Detection via Instance-Level Temporal Cycle Confusion".

Robust Object Detection via Instance-Level Temporal Cycle Confusion This repo contains the implementation of the ICCV 2021 paper, Robust Object Detect

Xin Wang 69 Oct 13, 2022
Self-Supervised Collision Handling via Generative 3D Garment Models for Virtual Try-On

Self-Supervised Collision Handling via Generative 3D Garment Models for Virtual Try-On [Project website] [Dataset] [Video] Abstract We propose a new g

71 Dec 24, 2022
A library for preparing, training, and evaluating scalable deep learning hybrid recommender systems using PyTorch.

collie_recs Collie is a library for preparing, training, and evaluating implicit deep learning hybrid recommender systems, named after the Border Coll

ShopRunner 97 Jan 03, 2023
Unofficial implementation of "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" (https://arxiv.org/abs/2103.14030)

Swin-Transformer-Tensorflow A direct translation of the official PyTorch implementation of "Swin Transformer: Hierarchical Vision Transformer using Sh

52 Dec 29, 2022
Towards Ultra-Resolution Neural Style Transfer via Thumbnail Instance Normalization

Towards Ultra-Resolution Neural Style Transfer via Thumbnail Instance Normalization Official PyTorch implementation for our URST (Ultra-Resolution Sty

czczup 148 Dec 27, 2022
Deploy optimized transformer based models on Nvidia Triton server

🤗 Hugging Face Transformer submillisecond inference 🤯 and deployment on Nvidia Triton server Yes, you can perfom inference with transformer based mo

Lefebvre Sarrut Services 1.2k Jan 05, 2023
Neural network chess engine trained on Gary Kasparov's games.

Neural Chess It's not the best chess engine, but it is a chess engine. Proof of concept neural network chess engine (feed-forward multi-layer perceptr

3 Jun 22, 2022
This script scrapes and stores the availability of timeslots for Car Driving Test at all RTA Serivce NSW centres in the state.

This script scrapes and stores the availability of timeslots for Car Driving Test at all RTA Serivce NSW centres in the state. Dependencies Account wi

Balamurugan Soundararaj 21 Dec 14, 2022
DCGAN LSGAN WGAN-GP DRAGAN PyTorch

Recommendation Our GAN based work for facial attribute editing - AttGAN. News 8 April 2019: We re-implement these GANs by Tensorflow 2! The old versio

Zhenliang He 408 Nov 30, 2022
Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.

Pyserini Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations. Retrieval using sparse re

Castorini 706 Dec 29, 2022
Official PyTorch implementation of the paper "Recycling Discriminator: Towards Opinion-Unaware Image Quality Assessment Using Wasserstein GAN", accepted to ACM MM 2021 BNI Track.

RecycleD Official PyTorch implementation of the paper "Recycling Discriminator: Towards Opinion-Unaware Image Quality Assessment Using Wasserstein GAN

Yunan Zhu 23 Nov 05, 2022
Official pytorch code for "APP: Anytime Progressive Pruning"

APP: Anytime Progressive Pruning Diganta Misra1,2,3, Bharat Runwal2,4, Tianlong Chen5, Zhangyang Wang5, Irina Rish1,3 1 Mila - Quebec AI Institute,2 L

Landskape AI 12 Nov 22, 2022
Readings for "A Unified View of Relational Deep Learning for Polypharmacy Side Effect, Combination Therapy, and Drug-Drug Interaction Prediction."

Polypharmacy - DDI - Synergy Survey The Survey Paper This repository accompanies our survey paper A Unified View of Relational Deep Learning for Polyp

AstraZeneca 79 Jan 05, 2023