Contrastive Product Matching

This repository contains the code and data download links to reproduce the experiments of the paper "Supervised Contrastive Learning for Product Matching" by Ralph Peeters and Christian Bizer. ArXiv link. A comparison of the results to other systems using different benchmark datasets is found at Papers with Code - Entity Resolution.

Requirements

Anaconda3

Please keep in mind that the code is not optimized for portable or even non-workstation devices. Some of the scripts may require large amounts of RAM (64GB+) and GPUs. It is advised to use a powerful workstation or server when experimenting with some of the larger files.

The code has only been used and tested on Linux (CentOS) servers.
Building the conda environment

To build the exact conda environment used for the experiments, navigate to the project root folder where the file contrastive-product-matching.yml is located and run conda env create -f contrastive-product-matching.yml

Furthermore you need to install the project as a package. To do this, activate the environment with conda activate contrastive-product-matching, navigate to the root folder of the project, and run pip install -e .
Downloading the raw data files

Navigate to the src/data/ folder and run python download_datasets.py to automatically download the files into the correct locations. You can find the data at data/raw/

If you are only interested in the separate datasets, you can download the WDC LSPC datasets and the deepmatcher splits for the abt-buy and amazon-google datasets on the respective websites.
Processing the data

To prepare the data for the experiments, run the following scripts in that order. Make sure to navigate to the respective folders first.
1. src/processing/preprocess/preprocess_corpus.py
2. src/processing/preprocess/preprocess_ts_gs.py
3. src/processing/preprocess/preprocess_deepmatcher_datasets.py
4. src/processing/contrastive/prepare_data.py
5. src/processing/contrastive/prepare_data_deepmatcher.py
Running the Contrastive Pre-training and Cross-entropy Fine-tuning

Navigate to src/contrastive/

You can find respective scripts for running the experiments of the paper in the subfolders lspc/ abtbuy/ and amazongoogle/. Note that you need to adjust the file path in these scripts for your system (replace your_path with path/to/repo).
- Contrastive Pre-training
  
  To run contrastive pre-training for the abtbuy dataset for example use
  
  bash abtbuy/run_pretraining_clean_roberta.sh BATCH_SIZE LEARNING_RATE TEMPERATURE (AUG)
  
  You need to specify batch site, learning rate and temperature as arguments here. Optionally you can also apply data augmentation by passing an augmentation method as last argument (use all- for the augmentation used in the paper).
  
  For the WDC Computers data you need to also supply the size of the training set, e.g.
  
  bash lspc/run_pretraining_roberta.sh BATCH_SIZE LEARNING_RATE TEMPERATURE TRAIN_SIZE (AUG)
- Cross-entropy Fine-tuning
  
  Finally, to use the pre-trained models for fine-tuning, run any of the fine-tuning scripts in the respective folders, e.g.
  
  bash abtbuy/run_finetune_siamese_frozen_roberta.sh BATCH_SIZE LEARNING_RATE TEMPERATURE (AUG)
  
  Please note, that BATCH_SIZE refers to the batch size used in pre-training. The fine-tuning batch size is locked to 64 but can be adjusted in the bash scripts if needed.
  
  Analogously for fine-tuning WDC computers, add the train size:
  
  bash lspc/run_finetune_siamese_frozen_roberta.sh BATCH_SIZE LEARNING_RATE TEMPERATURE TRAIN_SIZE (AUG)

Project based on the cookiecutter data science project template. #cookiecutterdatascience

Supervised Contrastive Learning for Product Matching

Related tags

Overview

Contrastive Product Matching

Owner

Web-based Systems Group @ University of Mannheim

Bottleneck Transformers for Visual Recognition

Single/multi view image(s) to voxel reconstruction using a recurrent neural network

Official Pytorch implementation of RePOSE (ICCV2021)

This is the official repository of Music Playlist Title Generation: A Machine-Translation Approach.

SPEAR: Semi suPErvised dAta progRamming

LaneDet is an open source lane detection toolbox based on PyTorch that aims to pull together a wide variety of state-of-the-art lane detection models

Codebase for the Summary Loop paper at ACL2020

Transparent Transformer Segmentation

Robust Video Matting in PyTorch, TensorFlow, TensorFlow.js, ONNX, CoreML!

"Learning and Analyzing Generation Order for Undirected Sequence Models" in Findings of EMNLP, 2021

The official PyTorch implementation of Curriculum by Smoothing (NeurIPS 2020, Spotlight).

A code repository associated with the paper A Benchmark for Rough Sketch Cleanup by Chuan Yan, David Vanderhaeghe, and Yotam Gingold from SIGGRAPH Asia 2020.

[CVPR 2020] GAN Compression: Efficient Architectures for Interactive Conditional GANs

OpenMMLab 3D Human Parametric Model Toolbox and Benchmark

The official code repo of "HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection"

Submodular Subset Selection for Active Domain Adaptation (ICCV 2021)

Official implementation of the Neurips 2021 paper Searching Parameterized AP Loss for Object Detection.

Unofficial PyTorch implementation of Fastformer based on paper "Fastformer: Additive Attention Can Be All You Need"."

Official implementation of Self-supervised Image-to-text and Text-to-image Synthesis

Pytorch code for paper "Image Compressed Sensing Using Non-local Neural Network" TMM 2021.