Materials to reproduce our findings in our stories, "Amazon Puts Its Own 'Brands' First Above Better-Rated Products" and "When Amazon Takes the Buy Box, it Doesn’t Give it up"

Overview

Amazon Brands and Exclusives

This repository contains code to reproduce the findings featured in our story "Amazon Puts Its Own 'Brands' First Above Better-Rated Products" and "When Amazon Takes the Buy Box, it Doesn’t Give it up" from our series Amazon's Advantage.

Our methodology is described in "How We Analyzed Amazon’s Treatment of Its Brands in Search Results".

Data that we collected and analyzed is in the data folder.
To use the full input dataset (which is not hosted here), please refer to Download data.

Jupyter notebooks used for data preprocessing and analysis are available in the notebooks folder.
Descriptions for each notebook are outlined in the Notebooks section below.

Installation

Python

Make sure you have Python 3.6+ installed. We used Miniconda to create a Python 3.8 virtual environment.

Then install the Python packages:
pip install -r requirements.txt

Notebooks

These notebooks are intended to be run sequentially, but they are not dependent on one another. If you want a quick overview of the methodology, you only need to concern yourself with the notebooks with an asterisk(*).

0-data-preprocessing.ipynb

This notebook parses Amazon search results and Amazon product pages, and produces the intermediary datasets (data/output/datasets/) used in ranking analysis and random forest classifiers.

1-data-analysis-search-results.ipynb *

Bulk of the ranking analysis and stats in the data analysis.

2-random-forest-analysis.ipynb *

Feature engineering training set, finding optimal hyperparameters, and performing the ablation study on a random forest model. The most predictive feature is verified using three separate methods.

3-survey-results.ipynb

Visualizing the survey results from our national panel of 1,000 adults.

4-limiations-product-page-changes.ipynb

Analysis of how often the Buy Box's default shipper and seller change between Amazon and a third party.

utils.py

Contains convenient functions used in the notebooks.

parsers.py

Contains parsers for search results and product pages.

Data

This directory is where inputs, intermediaries, and outputs are saved.

data
├── output
│   ├── figures
│   ├── tables
│   └── datasets
│       ├── amazon_private_label.csv.xz
│       ├── products.csv.xz
│       ├── searches.csv.xz
│       ├── training_set.csv.gz
│       ├── pairwise_training_set.csv.gz
│       └── trademarks
└── input
    ├── combined_queries_with_source.csv
    ├── best_sellers
    ├── generic_search_terms
    ├── search-private-label
    ├── search-selenium
    ├── search-selenium-our-brands-filter_
    ├── selenium-products
    ├── seller_central
    └── spotcheck

data/output/ contains tables, figures, and datasets used in our methodology.

data/output/datasets/amazon_private_label.csv.xz is our dataset of Amazon brands, exclusives, and proprietary electronics (N=137,428 products). We use each product's unique ID (called an ASIN) to identify Amazon's own products in our methodology.

data/output/datasets/trademarks contains a dataset of trademarked brands registered by Amazon. The data was collected from USPTO.gov and Amazon. We included an additional README with the exact steps we took to build this dataset in the directory.

data/output/datasets/searches.csv.xz parsed search result pages from top and generic searches (N=187,534 product positions). You can filter this by search_term for each of these subsets from data/input/combined_queries_with_source.csv.

data/output/datasets/products.csv.xz parsed product pages from the searches above (N=157,405 product pages).

data/output/training_set.csv.gz metadata used to train and evaluate the random forest. Additionally, feature engineering is conducted in notebooks/2-random-forest-analysis.ipynb, which produces pairwise_training_set.csv.gz.

Every file in data/input except combined_queries_with_source.csv is stored in AWS s3. Those are not hosted in this repository.

Download Data

You can find the raw inputs in data/input in s3://markup-public-data/amazon-brands/.

If you trust us, you can download the HTML and JSON files in data/input using this script: sh data/download_input_data.sh

Note this is not necessary to run notebooks and see full results.

data/input/search-selenium/ (12 GB uncompressed)

First page of search results collected in January 2021. Download the HTML files search-selenium.tar.xz (238 MB compressed) here.

data/input/selenium-products/ (220 GB uncompressed)

Product pages collected in February 2021. Download the HTML files selenium-products.tar.xz (9 GB compressed) here.

data/input/search-selenium-our-brands-filter_/ (35 GB uncompressed)

Search results filtered by "our brands". Contains every page of search results. Download search-selenium-our-brands-filter_.tar.xz (403 MB compressed) here.

data/input/search-private-label/ (25 GB uncompressed)

API responses for search results filtered down to products Amazon identifies as "our brands". Contains paginated API results. Download the JSON files search-private-label.tar.xz (402 MB uncompressed) here.

data/input/seller_central/ (105 MB)

Seller central data for Q4 2020. Download the CSV file All_Q4_2020.csv.xz (105 MB compressioned) here.

data/input/best_sellers/ (4 GB)

Amazon's best sellers under the category "Amazon Devices & Accessories". Download the HTML files best_sellers.tar.xz (60MB compressed) here.

data/input/spotcheck/ (4 GB)

A sub-sample of product pages for spot-checking Buy Box changes. Download the HTML files spotcheck.tar.xz (159 MB compressed) here.

A CLI tool to transfer, sync, and backup playlists on music streaming services

unitunes A command-line interface tool to manage playlists across music streaming services. Introduction unitunes manages playlists across streaming s

Victor Tao 50 Jan 07, 2023
A Python 2.7/3.x module for Amcrest Cameras using the SDK HTTP API.

A Python 2.7/3.x module for Amcrest Cameras using the SDK HTTP API. Amcrest and Dahua devices share similar firmwares. Dahua Cameras and NVRs also work with this module.

Marcelo Moreira de Mello 176 Dec 21, 2022
A Python wrapper around the OpenWeatherMap web API

PyOWM A Python wrapper around OpenWeatherMap web APIs What is it? PyOWM is a client Python wrapper library for OpenWeatherMap (OWM) web APIs. It allow

Claudio Sparpaglione 740 Dec 18, 2022
Ethone-Selfbot - Open Source Discord Self-Bot, written in discord.py

Ethone SB Table of contents Newest open-source Discord SelfBot with useful commands and easy documentation on how to add your own and change the exist

Ethone 3 Jan 08, 2022
A python script to send sms anonymously with SMS Gateway API. Works on command line terminal.

incognito-sms-sender A python script to send sms anonymously with SMS Gateway API. Works on command line terminal. Download and run script Go to API S

ʀᴇxɪɴᴀᴢᴏʀ 1 Oct 25, 2021
A feishu bot daily push arxiv latest articles.

arxiv-feishu-bot We develop A simple feishu bot script daily pushes arxiv latest articles. His effect is as follows: Of course, you can also use other

huchi 6 Apr 06, 2022
PYAW allows you to call assembly from python

PYAW allows you to call assembly from python

2 Dec 13, 2021
Use Seaborn to visualize interpret the byte layout of Solana account types

solana-account-vis Use Seaborn to visually interpret the byte layout of Solana account types Usage from account_visualization import generate_account_

Jarry Xiao 15 Aug 25, 2022
Python Client Library to interface with the Phoenix Realtime Server

supabase-realtime-client Python Client Library to interface with the Phoenix Realtime Server This is a fork of the supabase community realtime client

Anand 2 May 24, 2022
A self-bot for discord, written in Python, which will send you notifications to your desktop if it detects an intruder on your discord server

A self-bot for discord, written in Python, which will send you notifications to your desktop if it detects an intruder on your discord server

LevPrav 1 Jan 11, 2022
LyricsGenius: a Python client for the Genius.com API

LyricsGenius: a Python client for the Genius.com API lyricsgenius provides a simple interface to the song, artist, and lyrics data stored on Genius.co

KevinChunye 2 Jun 30, 2022
Clubhouse API written in Python. Standalone client included. For reference and education purposes only.

clubhouse-py is originally developed for the sake of interoperability. Standalone client is also created with very basic features, including but not limited to the audio-chat

1.7k Jan 05, 2023
Python Client for Instagram API

This project is not actively maintained. Proceed at your own risk! python-instagram A Python 2/3 client for the Instagram REST and Search APIs Install

Facebook Archive 2.9k Dec 30, 2022
A GETTR API client written in Python.

GUTTR A GETTR client library written in Python. I rushed to get this out so it's a bit janky. Open an issue if something is broken or missing. Getting

Roger Johnston 13 Nov 23, 2022
Python Library to Extract youtube video Tags without Youtube API

YoutubeTags Python Library to Extract youtube video Tags without Youtube API Installation pip install YoutubeTags Example import YoutubeTags from Yout

Nuhman Pk 17 Nov 12, 2022
Cyber Userbot

Cyber Userbot

Irham 0 May 26, 2022
A collection of discord tools I've made.

Discord A collection of discord tools i've made. What's in here? Basically every discord related project i've worked on can be found here, i'll try an

?? ?? ?? 6 Nov 13, 2021
WallAlley.bot is an open source and free to use financial discord bot originaly build for WallAlley server's community

WallAlley.bot About WallAlley.bot is an open source and free to use financial discord bot originaly build for WallAlley server's community. All data a

Mohammad KHADDAN 1 Jan 22, 2022
A management system designed for the employees of MIRAS (Art Gallery). It is used to sell/cancel tickets, book/cancel events and keeps track of all upcoming events.

Art-Galleria-Management-System Its a management system designed for the employees of MIRAS (Art Gallery). Backend : Python Frontend : Django Database

Areesha Tahir 8 Nov 30, 2022
PRAW, an acronym for "Python Reddit API Wrapper", is a python package that allows for simple access to Reddit's API.

PRAW: The Python Reddit API Wrapper PRAW, an acronym for "Python Reddit API Wrapper", is a Python package that allows for simple access to Reddit's AP

Python Reddit API Wrapper Development 3k Dec 29, 2022