MAVE: : A Product Dataset for Multi-source Attribute Value Extraction

Related tags

Deep LearningMAVE
Overview

MAVE: : A Product Dataset for Multi-source Attribute Value Extraction

The dataset contains 3 million attribute-value annotations across 1257 unique categories created from 2.2 million cleaned Amazon product profiles. It is a large, multi-sourced, diverse dataset for product attribute extraction study.

More details can be found in paper: https://arxiv.org/abs/2112.08663

The dataset is in JSON Lines format, where each line is a json object with the following schema:

, "category": , "paragraphs": [ { "text": , "source": }, ... ], "attributes": [ { "key": , "evidences": [ { "value": , "pid": , "begin": , "end": }, ... ] }, ... ] }">
{
   "id": 
           
            ,
   "category": 
            
             ,
   "paragraphs": [
      {
         "text": 
             
              ,
         "source": 
              
               
      },
      ...
   ],
   "attributes": [
      {
         "key": 
               
                , "evidences": [ { "value": 
                
                 , "pid": 
                 
                  , "begin": 
                  
                   , "end": 
                   
                     }, ... ] }, ... ] } 
                   
                  
                 
                
               
              
             
            
           

The product id is exactly the ASIN number in the All_Amazon_Meta.json file in the Amazon Review Data (2018). In this repo, we don't store paragraphs, instead we only store the labels. To obtain the full version of the dataset contaning the paragraphs, we suggest to first request the Amazon Review Data (2018), then run our binary to clean its product metadata and join with the labels as described below.

A json object contains a product and multiple attributes. A concrete example is shown as follows

{
   "id":"B0002H0A3S",
   "category":"Guitar Strings",
   "paragraphs":[
      {
         "text":"D'Addario EJ26 Phosphor Bronze Acoustic Guitar Strings, Custom Light, 11-52",
         "source":"title"
      },
      {
         "text":".011-.052 Custom Light Gauge Acoustic Guitar Strings, Phosphor Bronze",
         "source":"description"
      },
      ...
   ],
   "attributes":[
      {
         "key":"Core Material",
         "evidences":[
            {
               "value":"Bronze Acoustic",
               "pid":0,
               "begin":24,
               "end":39
            },
            ...
         ]
      },
      {
         "key":"Winding Material",
         "evidences":[
            {
               "value":"Phosphor Bronze",
               "pid":0,
               "begin":15,
               "end":30
            },
            ...
         ]
      },
      {
         "key":"Gauge",
         "evidences":[
            {
               "value":"Light",
               "pid":0,
               "begin":63,
               "end":68
            },
            {
               "value":"Light Gauge",
               "pid":1,
               "begin":17,
               "end":28
            },
            ...
         ]
      }
   ]
}

In addition to positive examples, we also provide a set of negative examples, i.e. (product, attribute name) pairs without any evidence. The overall statistics of the positive and negative sets are as follows

Counts Positives Negatives
# products 2226509 1248009
# product-attribute pairs 2987151 1780428
# products with 1-2 attributes 2102927 1140561
# products with 3-5 attributes 121897 99896
# products with >=6 attributes 1685 7552
# unique categories 1257 1114
# unique attributes 705 693
# unique category-attribute pairs 2535 2305

Creating the full version of the dataset

In this repo, we only open source the labels of the MAVE dataset and the code to deterministically clean the original Amazon product metadata in the Amazon Review Data (2018), and join with the labels to generate the full version of the MAVE dataset. After this process, the attribute values, paragraph ids and begin/end span indices will be consistent with the cleaned product profiles.

Step 1

Gain access to the Amazon Review Data (2018) and download the All_Amazon_Meta.json file to the folder of this repo.

Step 2

Run script

./clean_amazon_product_metadata_main.sh

to clean the Amazon metadata and join with the positive and negative labels in the labels/ folder. The output full MAVE dataset will be stored in the reproduce/ folder.

The script runs the clean_amazon_product_metadata_main.py binary using an apache beam pipeline. The binary will run on a single CPU core, but distributed setup can be enabled by changing pipeline options. The binary contains all util functions used to clean the Amazon metadata and join with labels. The pipeline will finish within a few hours on a single Intel Xeon 3GHz CPU core.

Owner
Google Research Datasets
Datasets released by Google Research
Google Research Datasets
My implementation of DeepMind's Perceiver

DeepMind Perceiver (in PyTorch) Disclaimer: This is not official and I'm not affiliated with DeepMind. My implementation of the Perceiver: General Per

Louis Arge 55 Dec 12, 2022
You Only Look One-level Feature (YOLOF), CVPR2021, Detectron2

You Only Look One-level Feature (YOLOF), CVPR2021 A simple, fast, and efficient object detector without FPN. This repo provides a neat implementation

qiang chen 273 Jan 03, 2023
Segmentation for medical image.

EfficientSegmentation Introduction EfficientSegmentation is an open source, PyTorch-based segmentation framework for 3D medical image. Features A whol

68 Nov 28, 2022
Title: Graduate-Admissions-Predictor

The purpose of this project is create a predictive model capable of identifying the probability of a person securing an admit based on their personal profile parameters. Simplified visualisations hav

Akarsh Singh 1 Jan 26, 2022
This game was designed to encourage young people not to gamble on lotteries, as the probablity of correctly guessing the number is infinitesimal!

Lottery Simulator 2022 for Web Launch Application Developed by John Seong in Ontario. This game was designed to encourage young people not to gamble o

John Seong 2 Sep 02, 2022
Recreate CenternetV2 based on MMDET.

Introduction This project is trying to Recreate CenternetV2 based on MMDET, which is proposed in paper Probabilistic two-stage detection. This project

25 Dec 09, 2022
A Python library for Deep Graph Networks

PyDGN Wiki Description This is a Python library to easily experiment with Deep Graph Networks (DGNs). It provides automatic management of data splitti

Federico Errica 194 Dec 22, 2022
MatchGAN: A Self-supervised Semi-supervised Conditional Generative Adversarial Network

MatchGAN: A Self-supervised Semi-supervised Conditional Generative Adversarial Network This repository is the official implementation of MatchGAN: A S

Justin Sun 12 Dec 27, 2022
basic tutorial on pytorch

Quick Tutorial on PyTorch PyTorch Basics Linear Regression Logistic Regression Artificial Neural Networks Convolutional Neural Networks Recurrent Neur

7 Sep 15, 2022
GND-Nets (Graph Neural Diffusion Networks) in TensorFlow.

GNDC For submission to IEEE TKDE. Overview Here we provide the implementation of GND-Nets (Graph Neural Diffusion Networks) in TensorFlow. The reposit

Wei Ye 3 Aug 08, 2022
The audio-video synchronization of MKV Container Format is exploited to achieve data hiding

The audio-video synchronization of MKV Container Format is exploited to achieve data hiding, where the hidden data can be utilized for various management purposes, including hyper-linking, annotation

Maxim Zaika 1 Nov 17, 2021
ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation.

ENet This work has been published in arXiv: ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation. Packages: train contains too

e-Lab 344 Nov 21, 2022
Precomputed Real-Time Texture Synthesis with Markovian Generative Adversarial Networks

MGANs Training & Testing code (torch), pre-trained models and supplementary materials for "Precomputed Real-Time Texture Synthesis with Markovian Gene

290 Nov 15, 2022
PyTorch implementation of normalizing flow models

PyTorch implementation of normalizing flow models

Vincent Stimper 242 Jan 02, 2023
Unofficial implementation of Proxy Anchor Loss for Deep Metric Learning

Proxy Anchor Loss for Deep Metric Learning Unofficial pytorch, tensorflow and mxnet implementations of Proxy Anchor Loss for Deep Metric Learning. Not

Geonmo Gu 3 Jun 09, 2021
Revisiting Weakly Supervised Pre-Training of Visual Perception Models

SWAG: Supervised Weakly from hashtAGs This repository contains SWAG models from the paper Revisiting Weakly Supervised Pre-Training of Visual Percepti

Meta Research 134 Jan 05, 2023
🔥 Real-time Super Resolution enhancement (4x) with content loss and relativistic adversarial optimization 🔥

🔥 Real-time Super Resolution enhancement (4x) with content loss and relativistic adversarial optimization 🔥

Rishik Mourya 48 Dec 20, 2022
Airborne magnetic data of the Osborne Mine and Lightning Creek sill complex, Australia

Osborne Mine, Australia - Airborne total-field magnetic anomaly This is a section of a survey acquired in 1990 by the Queensland Government, Australia

Fatiando a Terra Datasets 1 Jan 21, 2022
Testing and Estimation of structural breaks in Stata

xtbreak estimating and testing for many known and unknown structural breaks in time series and panel data. For an overview of xtbreak test see xtbreak

Jan Ditzen 13 Jun 19, 2022
GPU implementation of $k$-Nearest Neighbors and Shared-Nearest Neighbors

GPU implementation of kNN and SNN GPU implementation of $k$-Nearest Neighbors and Shared-Nearest Neighbors Supported by numba cuda and faiss library E

Hyeon Jeon 7 Nov 23, 2022