Code for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned Language Models in the wild .

Overview

🌳 Fingerprinting Fine-tuned Language Models in the wild

This is the code and dataset for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned Language Models in the wild .

Clone the repo

git clone https://github.com/LCS2-IIITD/ACL-FFLM.git
pip3 install -r requirements.txt 

Dataset

The dataset includes both organic and synthetic text.

  • Synthetic -

    Collected from posts of r/SubSimulatorGPT2. Each user on the subreddit is a GPT2 small (345 MB) bot that is fine-tuned on 500k posts and comments from a particular subreddit (e.g., r/askmen, r/askreddit,r/askwomen). The bots generate posts on r/SubSimulatorGPT2, starting off with the main post followed by comments (and replies) from other bots. The bots also interact with each other by using the synthetic text in the preceding comment/reply as their prompt. In total, the sub-reddit contains 401,214 comments posted between June 2019 and January 2020 by 108 fine-tuned GPT2 LMs (or class).

  • Organic -

    Collected from comments of 108 subreddits the GPT2 bots have been fine-tuned upon. We randomly collected about 2000 comments between the dates of June 2019 - Jan 2020.

The complete dataset is available here. Download the dataset as follows -

  1. Download the 2 folders organic and synthetic, containing the comments from individual classes.
  2. Store them in the data folder in the following format.
data
├── organic
├── synthetic
└── authors.csv

Note -
For the below TL;DR run you also need to download dataset.json and dataset.pkl files which contain pre-processed data.
Organize them in the dataset/synthetic folder as follows -

dataset
├── organic
├── synthetic
  ├── splits (Folder already present)
    ├── 6 (Folder already present)
      └── 108_800_100_200_dataset.json (File already present)
  ├── dataset.json (to be added via drive link)
  └── dataset.pkl (to be added via drive link)
└── authors.csv (File already present)

108_800_100_200_dataset.json is a custom dataset which contains the comment ID's, the labels and their separation into train, test and validation splits.
Upon running the models, the comments for each split are fetched from the dataset.json using the comment ID's in the 108_800_100_200_dataset.json file .

Running the code

TL;DR

You can skip the pre-processing and the Create Splits if you want to run the code on some custom datasets available in the dataset/synthetic....splits folder. Make sure to follow the instructions mentioned in the Note of the Dataset section for the setting up the dataset folders.

Pre-process the dataset

First, we pre-process the complete dataset using the data present in the folder create-splits. Select the type of data (organic/synthetic) you want to pre-process using the parameter synthetic in the file. By deafult the parameter is set for synthetic data i.e True. This would create a pre-processed dataset.json and dataset.pkl files in the dataset/[organic OR synthetic] folder.

Create Train, Test and Validation Splits

We create splits of train, test and validation data. The parameters such as min length of sentences (default 6), lowercase sentences, size of train (max and default 800/class), validation (max and default 100/class) and test (max and default 200/class),number of classes (max and default 108) can be set internally in the create_splits.py in the splits folder under the commented PARAMETERS Section.

cd create-splits.py
python3 create_splits.py

This creates a folder in the folder dataset/synthetic/splits/[min_len_of_sentence/min_nf_tokens = 6]/. The train, validation and test datasets are all stored in the same file with the filename [#CLASSES]_[#TRAIN_SET_SIZE]_[#VAL_SET_SIZE]_[#TEST_SET_SIZE]_dataset.json like 108_800_100_200_dataset.json.

Running the model

Now fix the same parameters in the seq_classification.py file. To train and test the best model (Fine-tuned GPT2/ RoBERTa) -

cd models/generate-embed/ft/
python3 seq_classification.py 

A results folder will be generated which will contain the results of each epoch.

Note -
For the other models - pretrained and writeprints, first generate the embeddings using the files in the folders models/generate-embed/[pre-trained or writeprints]. The generated embeddings are stored in the results/generate-embed folder. Then, use the script in the models/classifiers/[pre-trained or writeprints] to train sklearn classifiers on generated embeddings. The final results will be in the results/classifiers/[pre-trained or writeprints] folder.

👪 Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change. For any detailed clarifications/issues, please email to nirav17072[at]iiitd[dot]ac[dot]in .

⚖️ License

MIT

Owner
LCS2-IIITDelhi
Laboratory for Computation Social Systems (LCS2) is a research group led by Dr. Tanmoy Chakraborty and Dr. Md. Shad Akhtar at IIIT-D
LCS2-IIITDelhi
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis Jungil Kong, Jaehyeon Kim, Jaekyoung Bae In our paper, we p

Jungil Kong 1.1k Jan 02, 2023
Conversational-AI-ChatBot - Intelligent ChatBot built with Microsoft's DialoGPT transformer to make conversations with human users!

Conversational AI ChatBot Intelligent ChatBot built with Microsoft's DialoGPT transformer to make conversations with human users! In this project? Thi

Rajkumar Lakshmanamoorthy 6 Nov 30, 2022
Python library to make development of portfolio analysis faster and easier

Trafalgar Python library to make development of portfolio analysis faster and easier Installation 🔥 For the moment, Trafalgar is still in beta develo

Santosh Passoubady 641 Jan 01, 2023
Dé op-de-vlucht Pieton vertaler. Wereldwijd gebruikt door meer dan 1.000+ succesvolle bedrijven!

Dé op-de-vlucht Pieton vertaler. Wereldwijd gebruikt door meer dan 1.000+ succesvolle bedrijven!

Lau 1 Dec 17, 2021
LewusBot - Twitch ChatBot built in python with twitchio library

LewusBot Twitch ChatBot built in python with twitchio library. Uses twitch/leagu

Lewus 25 Dec 04, 2022
Collection of useful (to me) python scripts for interacting with napari

Napari scripts A collection of napari related tools in various state of disrepair/functionality. Browse_LIF_widget.py This module can be imported, for

5 Aug 15, 2022
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/

Texar is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar provides

ASYML 2.3k Jan 07, 2023
Twewy-discord-chatbot - Build a Discord AI Chatbot that Speaks like Your Favorite Character

Build a Discord AI Chatbot that Speaks like Your Favorite Character! This is a Discord AI Chatbot that uses the Microsoft DialoGPT conversational mode

Lynn Zheng 231 Dec 30, 2022
Easy to use, state-of-the-art Neural Machine Translation for 100+ languages

EasyNMT - Easy to use, state-of-the-art Neural Machine Translation This package provides easy to use, state-of-the-art machine translation for more th

Ubiquitous Knowledge Processing Lab 748 Jan 06, 2023
code for modular summarization work published in ACL2021 by Krishna et al

This repository contains the code for running modular summarization pipelines as described in the publication Krishna K, Khosla K, Bigham J, Lipton ZC

Approximately Correct Machine Intelligence (ACMI) Lab 21 Nov 24, 2022
Klexikon: A German Dataset for Joint Summarization and Simplification

Klexikon: A German Dataset for Joint Summarization and Simplification Dennis Aumiller and Michael Gertz Heidelberg University Under submission at LREC

Dennis Aumiller 8 Jan 03, 2023
Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation (SIGGRAPH Asia 2021)

Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation This repository contains the implementation of the following paper: Live Speech

OldSix 575 Dec 31, 2022
Switch spaces for knowledge graph embeddings

SwisE Switch spaces for knowledge graph embeddings. Requirements: python3 pytorch numpy tqdm Reproduce the results To reproduce the reported results,

Shuai Zhang 4 Dec 01, 2021
Based on 125GB of data leaked from Twitch, you can see their monthly revenues from 2019-2021

Twitch Revenues Bu script'i kullanarak istediğiniz yayıncıların, Twitch'den sızdırılan 125 GB'lik veriye dayanarak, 2019-2021 arası aylık gelirlerini

4 Nov 11, 2021
An end to end ASR Transformer model training repo

END TO END ASR TRANSFORMER 本项目基于transformer 6*encoder+6*decoder的基本结构构造的端到端的语音识别系统 Model Instructions 1.数据准备: 自行下载数据,遵循文件结构如下: ├── data │ ├── train │

旷视天元 MegEngine 10 Jul 19, 2022
A versatile token stream for handwritten parsers.

Writing recursive-descent parsers by hand can be quite elegant but it's often a bit more verbose than expected, especially when it comes to handling indentation and reporting proper syntax errors. Th

Valentin Berlier 8 Nov 30, 2022
Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Pytorch-version BERT-flow: One can apply BERT-flow to any PLM within Pytorch framework.

Ubiquitous Knowledge Processing Lab 59 Dec 01, 2022
BookNLP, a natural language processing pipeline for books

BookNLP BookNLP is a natural language processing pipeline that scales to books and other long documents (in English), including: Part-of-speech taggin

654 Jan 02, 2023
基于pytorch_rnn的古诗词生成

pytorch_peot_rnn 基于pytorch_rnn的古诗词生成 说明 config.py里面含有训练、测试、预测的参数,更改后运行: python main.py 预测结果 if config.do_predict: result = trainer.generate('丽日照残春')

西西嘛呦 3 May 26, 2022
ACL'2021: Learning Dense Representations of Phrases at Scale

DensePhrases DensePhrases is an extractive phrase search tool based on your natural language inputs. From 5 million Wikipedia articles, it can search

Princeton Natural Language Processing 540 Dec 30, 2022