Malware-Related Sentence Classification

This repo contains the code for the ICTAI 2021 paper "Enrichment of Features for Malware-Related Sentence Classification using External Knowledge".

Installation

Installation from the source. Python's virtual or Conda environments are recommended.

git clone https://github.com/chaumng/malware_related_sentence_classification.git
cd malware_related_sentence_classification
pip install -r requirements.txt

This repo is tested on Python 3.7.

Classification and Evaluation

Preprocess data

python preprocess_data.py

Parameter searching: Classify and evaluate

In this repo, we already provided the GAT weak labels in a file. To perform parameter searching, run the following command. The default value is to perform the second grid search. You can change the value of the argument param_grid_setting to "first_grid_search" perform the first grid search, or to "best_setting" to run only the best setting.

python svm_param_search.py --param_grid_setting second_grid_search

Citation

If you find this paper or this code useful, please cite this paper:

@inproceedings{chaunguyen_et_al_2021,
  title={Enrichment of Features for Malware-Related Sentence Classification using External Knowledge},
  author={Nguyen, Chau and Tran, Vu and Nguyen, Le Minh},
  booktitle={Proceedings of the 33rd IEEE International Conference on Tools with Artificial Intelligence (ICTAI)},
  year={2021},
  organization={IEEE},
}

Malware-Related Sentence Classification

Related tags

Overview

Malware-Related Sentence Classification

Installation

Classification and Evaluation

Preprocess data

Parameter searching: Classify and evaluate

Citation

Owner

Chau Nguyen

Turn clang-tidy warnings and fixes to comments in your pull request

👄 The most accurate natural language detection library for Python, suitable for long and short text alike

DeepSpeech - Easy-to-use Speech Toolkit including SOTA ASR pipeline, influential TTS with text frontend and End-to-End Speech Simultaneous Translation.

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Machine Psychology: Python Generated Art

Search msDS-AllowedToActOnBehalfOfOtherIdentity

Longformer: The Long-Document Transformer

Smart discord chatbot integrated with Dialogflow

Galois is an auto code completer for code editors (or any text editor) based on OpenAI GPT-2.

Nystromformer: A Nystrom-based Algorithm for Approximating Self-Attention

Code for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned Language Models in the wild .

FireFlyer Record file format, writer and reader for DL training samples.

使用pytorch+transformers复现了SimCSE论文中的有监督训练和无监督训练方法

Unsupervised intent recognition

Chinese NER(Named Entity Recognition) using BERT(Softmax, CRF, Span)

SAVI2I: Continuous and Diverse Image-to-Image Translation via Signed Attribute Vectors

NL. The natural language programming language.

Biterm Topic Model (BTM): modeling topics in short texts

Source code of paper "BP-Transformer: Modelling Long-Range Context via Binary Partitioning"

A natural language modeling framework based on PyTorch