Code for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned Language Models in the wild .

Overview

๐ŸŒณ Fingerprinting Fine-tuned Language Models in the wild

This is the code and dataset for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned Language Models in the wild .

Clone the repo

git clone https://github.com/LCS2-IIITD/ACL-FFLM.git
pip3 install -r requirements.txt 

Dataset

The dataset includes both organic and synthetic text.

  • Synthetic -

    Collected from posts of r/SubSimulatorGPT2. Each user on the subreddit is a GPT2 small (345 MB) bot that is fine-tuned on 500k posts and comments from a particular subreddit (e.g., r/askmen, r/askreddit,r/askwomen). The bots generate posts on r/SubSimulatorGPT2, starting off with the main post followed by comments (and replies) from other bots. The bots also interact with each other by using the synthetic text in the preceding comment/reply as their prompt. In total, the sub-reddit contains 401,214 comments posted between June 2019 and January 2020 by 108 fine-tuned GPT2 LMs (or class).

  • Organic -

    Collected from comments of 108 subreddits the GPT2 bots have been fine-tuned upon. We randomly collected about 2000 comments between the dates of June 2019 - Jan 2020.

The complete dataset is available here. Download the dataset as follows -

  1. Download the 2 folders organic and synthetic, containing the comments from individual classes.
  2. Store them in the data folder in the following format.
data
โ”œโ”€โ”€ organic
โ”œโ”€โ”€ synthetic
โ””โ”€โ”€ authors.csv

Note -
For the below TL;DR run you also need to download dataset.json and dataset.pkl files which contain pre-processed data.
Organize them in the dataset/synthetic folder as follows -

dataset
โ”œโ”€โ”€ organic
โ”œโ”€โ”€ synthetic
  โ”œโ”€โ”€ splits (Folder already present)
    โ”œโ”€โ”€ 6 (Folder already present)
      โ””โ”€โ”€ 108_800_100_200_dataset.json (File already present)
  โ”œโ”€โ”€ dataset.json (to be added via drive link)
  โ””โ”€โ”€ dataset.pkl (to be added via drive link)
โ””โ”€โ”€ authors.csv (File already present)

108_800_100_200_dataset.json is a custom dataset which contains the comment ID's, the labels and their separation into train, test and validation splits.
Upon running the models, the comments for each split are fetched from the dataset.json using the comment ID's in the 108_800_100_200_dataset.json file .

Running the code

TL;DR

You can skip the pre-processing and the Create Splits if you want to run the code on some custom datasets available in the dataset/synthetic....splits folder. Make sure to follow the instructions mentioned in the Note of the Dataset section for the setting up the dataset folders.

Pre-process the dataset

First, we pre-process the complete dataset using the data present in the folder create-splits. Select the type of data (organic/synthetic) you want to pre-process using the parameter synthetic in the file. By deafult the parameter is set for synthetic data i.e True. This would create a pre-processed dataset.json and dataset.pkl files in the dataset/[organic OR synthetic] folder.

Create Train, Test and Validation Splits

We create splits of train, test and validation data. The parameters such as min length of sentences (default 6), lowercase sentences, size of train (max and default 800/class), validation (max and default 100/class) and test (max and default 200/class),number of classes (max and default 108) can be set internally in the create_splits.py in the splits folder under the commented PARAMETERS Section.

cd create-splits.py
python3 create_splits.py

This creates a folder in the folder dataset/synthetic/splits/[min_len_of_sentence/min_nf_tokens = 6]/. The train, validation and test datasets are all stored in the same file with the filename [#CLASSES]_[#TRAIN_SET_SIZE]_[#VAL_SET_SIZE]_[#TEST_SET_SIZE]_dataset.json like 108_800_100_200_dataset.json.

Running the model

Now fix the same parameters in the seq_classification.py file. To train and test the best model (Fine-tuned GPT2/ RoBERTa) -

cd models/generate-embed/ft/
python3 seq_classification.py 

A results folder will be generated which will contain the results of each epoch.

Note -
For the other models - pretrained and writeprints, first generate the embeddings using the files in the folders models/generate-embed/[pre-trained or writeprints]. The generated embeddings are stored in the results/generate-embed folder. Then, use the script in the models/classifiers/[pre-trained or writeprints] to train sklearn classifiers on generated embeddings. The final results will be in the results/classifiers/[pre-trained or writeprints] folder.

๐Ÿ‘ช Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change. For any detailed clarifications/issues, please email to nirav17072[at]iiitd[dot]ac[dot]in .

โš–๏ธ License

MIT

Owner
LCS2-IIITDelhi
Laboratory for Computation Social Systems (LCS2) is a research group led by Dr. Tanmoy Chakraborty and Dr. Md. Shad Akhtar at IIIT-D
LCS2-IIITDelhi
Amazon Multilingual Counterfactual Dataset (AMCD)

Amazon Multilingual Counterfactual Dataset (AMCD)

35 Sep 20, 2022
Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks. It takes raw videos/images + text as inputs, and outputs task predictions. ClipB

Jie Lei ้›ทๆฐ 612 Jan 04, 2023
Conversational text Analysis using various NLP techniques

Conversational text Analysis using various NLP techniques

Rita Anjana 159 Jan 06, 2023
novel deep learning research works with PaddlePaddle

Research ๅ‘ๅธƒๅŸบไบŽ้ฃžๆกจ็š„ๅ‰ๆฒฟ็ ”็ฉถๅทฅไฝœ๏ผŒๅŒ…ๆ‹ฌCVใ€NLPใ€KGใ€STDM็ญ‰้ข†ๅŸŸ็š„้กถไผš่ฎบๆ–‡ๅ’Œๆฏ”่ต›ๅ† ๅ†›ๆจกๅž‹ใ€‚ ็›ฎๅฝ• ่ฎก็ฎ—ๆœบ่ง†่ง‰(Computer Vision) ่‡ช็„ถ่ฏญ่จ€ๅค„็†(Natrual Language Processing) ็Ÿฅ่ฏ†ๅ›พ่ฐฑ(Knowledge Graph) ๆ—ถ็ฉบๆ•ฐๆฎๆŒ–ๆŽ˜(Spa

1.5k Jan 03, 2023
Text Classification Using LSTM

Text classification is the task of assigning a set of predefined categories to free text. Text classifiers can be used to organize, structure, and categorize pretty much anything. For example, new ar

KrishArul26 3 Jan 03, 2023
Use the power of GPT3 to execute any function inside your programs just by giving some doctests

gptrun Don't feel like coding today? Use the power of GPT3 to execute any function inside your programs just by giving some doctests. How is this diff

Roberto Abdelkader Martรญnez Pรฉrez 11 Nov 11, 2022
Long text token classification using LongFormer

Long text token classification using LongFormer

abhishek thakur 161 Aug 07, 2022
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language mod

13.2k Jul 07, 2021
Generating new names based on trends in data using GPT2 (Transformer network)

MLOpsNameGenerator Overall Goal The goal of the project is to develop a model that is capable of creating Pokรฉmon names based on its description, usin

Gustav Lang Moesmand 2 Jan 10, 2022
Training code of Spatial Time Memory Network. Semi-supervised video object segmentation.

Training-code-of-STM This repository fully reproduces Space-Time Memory Networks Performance on Davis17 val set&Weights backbone training stage traini

haochen wang 128 Dec 11, 2022
Code for ACL 2021 main conference paper "Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances".

Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances This repository contains the code and pre-trained mode

ICTNLP 90 Dec 27, 2022
็ป“ๅทดไธญๆ–‡ๅˆ†่ฏ

jieba โ€œ็ป“ๅทดโ€ไธญๆ–‡ๅˆ†่ฏ๏ผšๅšๆœ€ๅฅฝ็š„ Python ไธญๆ–‡ๅˆ†่ฏ็ป„ไปถ "Jieba" (Chinese for "to stutter") Chinese text segmentation: built to be the best Python Chinese word segmentation

Sun Junyi 29.8k Jan 02, 2023
Framework for fine-tuning pretrained transformers for Named-Entity Recognition (NER) tasks

NERDA Not only is NERDA a mesmerizing muppet-like character. NERDA is also a python package, that offers a slick easy-to-use interface for fine-tuning

Ekstra Bladet 141 Dec 30, 2022
Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition

Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition

0 Feb 13, 2022
customer care chatbot made with Rasa Open Source.

Customer Care Bot Customer care bot for ecomm company which can solve faq and chitchat with users, can contact directly to team. ๐Ÿ›  Features Basic E-c

Dishant Gandhi 23 Oct 27, 2022
Mesh TensorFlow: Model Parallelism Made Easier

Mesh TensorFlow - Model Parallelism Made Easier Introduction Mesh TensorFlow (mtf) is a language for distributed deep learning, capable of specifying

1.3k Dec 26, 2022
A notebook that shows how to import the IITB English-Hindi Parallel Corpus from the HuggingFace datasets repository

We provide a notebook that shows how to import the IITB English-Hindi Parallel Corpus from the HuggingFace datasets repository. The notebook also shows how to segment the corpus using BPE tokenizatio

Computation for Indian Language Technology (CFILT) 9 Oct 13, 2022
A Word Level Transformer layer based on PyTorch and ๐Ÿค— Transformers.

Transformer Embedder A Word Level Transformer layer based on PyTorch and ๐Ÿค— Transformers. How to use Install the library from PyPI: pip install transf

Riccardo Orlando 27 Nov 20, 2022
Code for CVPR 2021 paper: Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning

Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning This is the PyTorch companion code for the paper: A

Amazon 69 Jan 03, 2023