MAVE: : A Product Dataset for Multi-source Attribute Value Extraction

Related tags

Deep LearningMAVE
Overview

MAVE: : A Product Dataset for Multi-source Attribute Value Extraction

The dataset contains 3 million attribute-value annotations across 1257 unique categories created from 2.2 million cleaned Amazon product profiles. It is a large, multi-sourced, diverse dataset for product attribute extraction study.

More details can be found in paper: https://arxiv.org/abs/2112.08663

The dataset is in JSON Lines format, where each line is a json object with the following schema:

, "category": , "paragraphs": [ { "text": , "source": }, ... ], "attributes": [ { "key": , "evidences": [ { "value": , "pid": , "begin": , "end": }, ... ] }, ... ] }">
{
   "id": 
           
            ,
   "category": 
            
             ,
   "paragraphs": [
      {
         "text": 
             
              ,
         "source": 
              
               
      },
      ...
   ],
   "attributes": [
      {
         "key": 
               
                , "evidences": [ { "value": 
                
                 , "pid": 
                 
                  , "begin": 
                  
                   , "end": 
                   
                     }, ... ] }, ... ] } 
                   
                  
                 
                
               
              
             
            
           

The product id is exactly the ASIN number in the All_Amazon_Meta.json file in the Amazon Review Data (2018). In this repo, we don't store paragraphs, instead we only store the labels. To obtain the full version of the dataset contaning the paragraphs, we suggest to first request the Amazon Review Data (2018), then run our binary to clean its product metadata and join with the labels as described below.

A json object contains a product and multiple attributes. A concrete example is shown as follows

{
   "id":"B0002H0A3S",
   "category":"Guitar Strings",
   "paragraphs":[
      {
         "text":"D'Addario EJ26 Phosphor Bronze Acoustic Guitar Strings, Custom Light, 11-52",
         "source":"title"
      },
      {
         "text":".011-.052 Custom Light Gauge Acoustic Guitar Strings, Phosphor Bronze",
         "source":"description"
      },
      ...
   ],
   "attributes":[
      {
         "key":"Core Material",
         "evidences":[
            {
               "value":"Bronze Acoustic",
               "pid":0,
               "begin":24,
               "end":39
            },
            ...
         ]
      },
      {
         "key":"Winding Material",
         "evidences":[
            {
               "value":"Phosphor Bronze",
               "pid":0,
               "begin":15,
               "end":30
            },
            ...
         ]
      },
      {
         "key":"Gauge",
         "evidences":[
            {
               "value":"Light",
               "pid":0,
               "begin":63,
               "end":68
            },
            {
               "value":"Light Gauge",
               "pid":1,
               "begin":17,
               "end":28
            },
            ...
         ]
      }
   ]
}

In addition to positive examples, we also provide a set of negative examples, i.e. (product, attribute name) pairs without any evidence. The overall statistics of the positive and negative sets are as follows

Counts Positives Negatives
# products 2226509 1248009
# product-attribute pairs 2987151 1780428
# products with 1-2 attributes 2102927 1140561
# products with 3-5 attributes 121897 99896
# products with >=6 attributes 1685 7552
# unique categories 1257 1114
# unique attributes 705 693
# unique category-attribute pairs 2535 2305

Creating the full version of the dataset

In this repo, we only open source the labels of the MAVE dataset and the code to deterministically clean the original Amazon product metadata in the Amazon Review Data (2018), and join with the labels to generate the full version of the MAVE dataset. After this process, the attribute values, paragraph ids and begin/end span indices will be consistent with the cleaned product profiles.

Step 1

Gain access to the Amazon Review Data (2018) and download the All_Amazon_Meta.json file to the folder of this repo.

Step 2

Run script

./clean_amazon_product_metadata_main.sh

to clean the Amazon metadata and join with the positive and negative labels in the labels/ folder. The output full MAVE dataset will be stored in the reproduce/ folder.

The script runs the clean_amazon_product_metadata_main.py binary using an apache beam pipeline. The binary will run on a single CPU core, but distributed setup can be enabled by changing pipeline options. The binary contains all util functions used to clean the Amazon metadata and join with labels. The pipeline will finish within a few hours on a single Intel Xeon 3GHz CPU core.

Owner
Google Research Datasets
Datasets released by Google Research
Google Research Datasets
Alphabetical Letter Recognition

BayeesNetworks-Image-Classification Alphabetical Letter Recognition In these demo we are using "Bayees Networks" Our database is composed by Learning

Mohammed Firass 4 Nov 30, 2021
Learn about Spice.ai with in-depth samples

Samples Learn about Spice.ai with in-depth samples ServerOps - Learn when to run server maintainance during periods of low load Gardener - Intelligent

Spice.ai 16 Mar 23, 2022
[NeurIPS 2021] Towards Better Understanding of Training Certifiably Robust Models against Adversarial Examples | ⛰️⚠️

Towards Better Understanding of Training Certifiably Robust Models against Adversarial Examples This repository is the official implementation of "Tow

Sungyoon Lee 4 Jul 12, 2022
Tiny-NewsRec: Efficient and Effective PLM-based News Recommendation

Tiny-NewsRec The source codes for our paper "Tiny-NewsRec: Efficient and Effective PLM-based News Recommendation". Requirements PyTorch == 1.6.0 Tensor

Yang Yu 3 Dec 07, 2022
Repository for scripts and notebooks from the book: Programming PyTorch for Deep Learning

Repository for scripts and notebooks from the book: Programming PyTorch for Deep Learning

Ian Pointer 368 Dec 17, 2022
Official Pytorch implementation of Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations

Scene Representation Networks This is the official implementation of the NeurIPS submission "Scene Representation Networks: Continuous 3D-Structure-Aw

Vincent Sitzmann 365 Jan 06, 2023
SeMask: Semantically Masked Transformers for Semantic Segmentation.

SeMask: Semantically Masked Transformers Jitesh Jain, Anukriti Singh, Nikita Orlov, Zilong Huang, Jiachen Li, Steven Walton, Humphrey Shi This repo co

Picsart AI Research (PAIR) 186 Dec 30, 2022
Differentiable simulation for system identification and visuomotor control

gradsim gradSim: Differentiable simulation for system identification and visuomotor control gradSim is a unified differentiable rendering and multiphy

105 Dec 18, 2022
A simple PyTorch Implementation of Generative Adversarial Networks, focusing on anime face drawing.

AnimeGAN A simple PyTorch Implementation of Generative Adversarial Networks, focusing on anime face drawing. Randomly Generated Images The images are

Jie Lei 雷杰 1.2k Jan 03, 2023
Semantic-aware Grad-GAN for Virtual-to-Real Urban Scene Adaption

SG-GAN TensorFlow implementation of SG-GAN. Prerequisites TensorFlow (implemented in v1.3) numpy scipy pillow Getting Started Train Prepare dataset. W

lplcor 61 Jun 07, 2022
Article Reranking by Memory-enhanced Key Sentence Matching for Detecting Previously Fact-checked Claims.

MTM This is the official repository of the paper: Article Reranking by Memory-enhanced Key Sentence Matching for Detecting Previously Fact-checked Cla

ICTMCG 13 Sep 17, 2022
PyContinual (An Easy and Extendible Framework for Continual Learning)

PyContinual (An Easy and Extendible Framework for Continual Learning) Easy to Use You can sumply change the baseline, backbone and task, and then read

176 Jan 05, 2023
A Python library for Deep Probabilistic Modeling

Abstract DeeProb-kit is a Python library that implements deep probabilistic models such as various kinds of Sum-Product Networks, Normalizing Flows an

DeeProb-org 46 Dec 26, 2022
PyTorch implementation of probabilistic deep forecast applied to air quality.

Probabilistic Deep Forecast PyTorch implementation of a paper, titled: Probabilistic Deep Learning to Quantify Uncertainty in Air Quality Forecasting

Abdulmajid Murad 13 Nov 16, 2022
Numerical-computing-is-fun - Learning numerical computing with notebooks for all ages.

As much as this series is to educate aspiring computer programmers and data scientists of all ages and all backgrounds, it is also a reminder to mysel

EKA foundation 758 Dec 25, 2022
This is the code for the paper "Jinkai Zheng, Xinchen Liu, Wu Liu, Lingxiao He, Chenggang Yan, Tao Mei: Gait Recognition in the Wild with Dense 3D Representations and A Benchmark. (CVPR 2022)"

Gait3D-Benchmark This is the code for the paper "Jinkai Zheng, Xinchen Liu, Wu Liu, Lingxiao He, Chenggang Yan, Tao Mei: Gait Recognition in the Wild

82 Jan 04, 2023
😊 Python module for face feature changing

PyWarping Python module for face feature changing Installation pip install pywarping If you get an error: No such file or directory: 'cmake': 'cmake',

Dopevog 10 Sep 10, 2021
An pytorch implementation of Masked Autoencoders Are Scalable Vision Learners

An pytorch implementation of Masked Autoencoders Are Scalable Vision Learners This is a coarse version for MAE, only make the pretrain model, the fine

FlyEgle 214 Dec 29, 2022
Tensorflow port of a full NetVLAD network

netvlad_tf The main intention of this repo is deployment of a full NetVLAD network, which was originally implemented in Matlab, in Python. We provide

Robotics and Perception Group 225 Nov 08, 2022
Python version of the amazing Reaction Mechanism Generator (RMG).

Reaction Mechanism Generator (RMG) Description This repository contains the Python version of Reaction Mechanism Generator (RMG), a tool for automatic

Reaction Mechanism Generator 284 Dec 27, 2022