Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.

Last update: Nov 28, 2022

Overview

Neural Scam Artist

TL;DR
A dataset of scam emails is scraped from an anti-fraud website. The dataset is then deduplicated using MinHash and LSH. The deduplicated dataset is used for fine-tuning GPT-2.

Comic stolen from Agent-X Comics.

📖 Table of contents

➤ Project Description
➤ Shared Files
➤ Requirements
➤ Installation
➤ Usage

☁️ Project Description

Objective

The goal of this project is create a new dataset of fraudulent emails that can advance the research on intelligent email assistants.

Web Scraper

Data is scraped from the website https://antifraudintl.org/. At first, a set of thread urls is collected and stored. Then, each thread is searched for emails. For each thread, at most one email is kept as the rest are duplicates. Metadata (Subject, Date etc) is removed. The resultant dataset is stored inside a csv file.

Deduplication

To avoid the quadratic complexity, a cheap alternative is selected: MinHash and LSH using the datasketch library. For each document, this method efficiently locates its nearest neighbors. Because this leads to a a large amount of false negatives (i.e. dulpicate documents that are classified as non-duplicates), the approach is extended by creating a duplicate graph. Nodes in this graph represent documents and are connected with an edge if their respective documents have been classified as duplicates. To deduplicate the dataset, connected components of the graph are located and for each component only a single node is selected. A readability criterion is used for selection.

GPT-2

A small pretrained GPT-2 model from the Huggingface library is fine-tuned on the deduplicated dataset. A collection of ~~cherry-picked~~ randomly selected generated samples can be found here here.

📁 Shared Files

Resource	Size	#Samples	Link
Full dataset	128.5 MB	85,160	Link
Deduplicated dataset	74.2 MB	58,227	Link
Thread urls	6.4 MB	95,324	Link
GPT-2 Checkpoints	~1.5 GB		Link

🧰 Requirements

See requirements.txt.

⚙️ Installation

$ git clone https://github.com/davidsvy/Neural-Scam-Artist
$ cd Neural-Scam-Artist
$ pip install -r requirements.txt

🧻 Usage

To generate dataset (~3 hours on Colab):


$ python create_dataset.py [-c configs/create_dataset.yaml]

To deduplicate dataset (~30 minutes on Colab):

$ python deduplicate_dataset.py [-c configs/deduplicate_dataset.yaml]

To train GPT-2 (~3 hours/epoch on Colab with K80):

$ python gpt2_train.py [-c configs/gpt2_train.yaml]

To generate text with GPT-2:

$ python gpt2_sample.py [-c configs/gpt2_sample.yaml]

Web Scraping, Document Deduplication & GPT-2 Fine-tuning with a newly created scam dataset.

Related tags

Overview

Neural Scam Artist

📖 Table of contents

☁️ Project Description

Objective

Web Scraper

Deduplication

GPT-2

📁 Shared Files

🧰 Requirements

⚙️ Installation

🧻 Usage

Owner

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

Repository for fine-tuning Transformers 🤗 based seq2seq speech models in JAX/Flax.

Search for documents in a domain through Google. The objective is to extract metadata

Code from the paper "High-Performance Brain-to-Text Communication via Handwriting"

Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

使用pytorch+transformers复现了SimCSE论文中的有监督训练和无监督训练方法

🗣️ NALP is a library that covers Natural Adversarial Language Processing.

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

The following links explain a bit the idea of semantic search and how search mechanisms work by doing retrieve and rerank

Code for "Semantic Role Labeling as Dependency Parsing: Exploring Latent Tree Structures Inside Arguments".

This repository will contain the code for the CVPR 2021 paper "GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields"

PyJPBoatRace: Python-based Japanese boatrace tools 🚤

My implementation of Safaricom Machine Learning Codility test. The code has bugs, logical I guess I made errors and any correction will be appreciated.

A2T: Towards Improving Adversarial Training of NLP Models (EMNLP 2021 Findings)

Constituency Tree Labeling Tool

Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

Implementation of legal QA system based on SentenceKoBART

Source code of the "Graph-Bert: Only Attention is Needed for Learning Graph Representations" paper

String Gen + Word Checker

Persian Bert For Long-Range Sequences