A Comprehensive Study on Learning-Based PE Malware Family Classification Methods

Datasets

Because of copyright issues, both the MalwareBazaar dataset and the MalwareDrift dataset just contain the malware SHA-256 hash and all of the related information which can be find in the Datasets folder. You can download raw malware samples from the open-source malware release website by applying an api-key, and use disassembly tool to convert the malware into binary and disassembly files.

The MalwareBazaar dataset : you can download the samples from MalwareBazaar.
The MalwareDrift dataset : you can download the samples from VirusShare.

Experimental Settings

Model	Training Strategy	Optimizer	Learning Rate	Batch Size	Input Format
ResNet-50	From Scratch	Adam	1e-3	64	224*224 color image
ResNet-50	Transfer	Adam	1e-3	All data*	224*224 color image
VGG-16	From Scratch	SGD	5e-6**	64	224*224 color image
VGG-16	Transfer	SGD	5e-6	64	224*224 color image
Inception-V3	From Scratch	Adam	1e-3	64	224*224 color image
Inception-V3	Transfer	Adam	1e-3	All data	224*224 color image
IMCFN	From Scratch	SGD	5e-6***	32	224*224 color image
IMCFN	Transfer	SGD	5e-6***	32	224*224 color image
CBOW+MLP	-	SGD	1e-3	128	CBOW: byte sequences; MLP: 256*256 matrix
MalConv	-	SGD	1e-3	32	2MB raw byte values
MAGIC	-	Adam	1e-4	10	ACFG
Word2Vec+KNN	-	-	-	-	Word2Vec: Opcode sequences; KNN distance measure: WMD
MCSC	-	SGD	5e-3	64	Opcode sequences

* The batch size is set to 128 for the MalwareBazaar dataset
** The learning rate is set to 5e-5 for the Malimg dataset and 1e-5 for the MalwareBazaar dataset
*** The learning rate is set to 1e-5 for the MalwareBazaar dataset
CBOW is with default parameters in the Word2Vec package in the Gensim library of Python

Graphically Analysis of Table 4 and Table 5

Here is a more detailed figure analysis for Table 4 and Table 5 in order to make the raw information in the paper easier to digest.

Table 4

The classification performance (F1-Score) of each approach on three datasets

The figure shows the classification performance (F1-Score) of each methods on three datasets. It is noteworthy that the Malimg dataset only contains malware images, and thus it can only be used to evaluate the 4 image-based methods.
The average classification performance (F1-Score) of each approach for three datasets

The figure shows the average classification performance (F1-Score) of each method for the three datasets. Among them, the F1-score corresponding to each model is obtained by averaging the F1-score of the model on three datasets, which represents the average performance.
The train time and resource overhead of each method on three datasets

The figure shows the train time (left subgraph) and resource overhead (right subgraph) needed for every method on three datasets. The bar immediately to the right of the train time bar is the memory overhead of this model. Similarly, there are only 4 image-based models for the Malimg dataset.

Table 5

The classification performance (F1-Score) of transfer learning for image-based approaches on three datasets

This figure shows the F1-Score obtained by every image-based model using the strategy of training from scratch, 10% transfer learning, 50% transfer learning, 80% transfer learning, and 100% transfer learning, respectively. Every subgraph correspond to the BIG-15, Malimg, and MalwareBazaar dataset, respectively.
The train time and resource overhead of transfer learning for image-based approaches on three datasets

Each row correspond to the BIG-15, Mmalimg, and MalwareBazaar dataset, respectively. For each row, there are 4 models (ResNet-50, VGG-16, Inception-V3 and IMCFN). For each model, there are 8 bars on the right, the left 4 bars stands for the train time under 10%, 50%, 80% and 100% transfer learning, and the right 4 bars are the memory overhead under 10%, 50%, 80% and 100% transfer learning.

A Comprehensive Study on Learning-Based PE Malware Family Classification Methods

Related tags

Overview

A Comprehensive Study on Learning-Based PE Malware Family Classification Methods

Datasets

Experimental Settings

Graphically Analysis of Table 4 and Table 5

Table 4

Table 5

Owner

Official PyTorch implementation of CAPTRA: CAtegory-level Pose Tracking for Rigid and Articulated Objects from Point Clouds

IEEE-CIS Technical Challenge on Predict+Optimize for Renewable Energy Scheduling

Implementation of ViViT: A Video Vision Transformer

Background Matting: The World is Your Green Screen

CR-Fill: Generative Image Inpainting with Auxiliary Contextual Reconstruction. ICCV 2021

Real-ESRGAN aims at developing Practical Algorithms for General Image Restoration.

Tensorboard for pytorch (and chainer, mxnet, numpy, ...)

torchbearer: A model fitting library for PyTorch

On the Limits of Pseudo Ground Truth in Visual Camera Re-Localization

Code for reproducing key results in the paper "InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets"

ICCV2021 Oral SA-ConvONet: Sign-Agnostic Optimization of Convolutional Occupancy Networks

Safe Local Motion Planning with Self-Supervised Freespace Forecasting, CVPR 2021

This is an official implementation for "PlaneRecNet".

Improving Contrastive Learning by Visualizing Feature Transformation, ICCV 2021 Oral

Official PyTorch code for WACV 2022 paper "CFLOW-AD: Real-Time Unsupervised Anomaly Detection with Localization via Conditional Normalizing Flows"

Deep Ensemble Learning with Jet-Like architecture

TumorInsight is a Brain Tumor Detection and Classification model built using RESNET50 architecture.

Official PyTorch implementation of "BlendGAN: Implicitly GAN Blending for Arbitrary Stylized Face Generation" (NeurIPS 2021)

Lex Rosetta: Transfer of Predictive Models Across Languages, Jurisdictions, and Legal Domains

Cerberus Transformer: Joint Semantic, Affordance and Attribute Parsing