Predicting Auction Sale Price using the kaggle bulldozer auction sales data: Modeling with Ensembles vs Neural Network

Overview

Predicting Auction Sale Price using the kaggle bulldozer auction sales data: Modeling with Ensembles vs Neural Network

The performances of tree ensembles and neural networks on structured data are evaluated. In addition, the effectiveness of combining neural network and decision trees (such as random trees, histogram based gradient boosting, and xgboost) is investigated. Covariant shift, Random forest's inability to extrapolate, and data leakage are investigated.

A simple 2-layer Neural network outperformed xgboost, followed by random forests. The worst performance based on RMSE was obtained from the histogram based gradient boosting regressor.

Overall, the best rmse (0.220194)--about 4.04% improvement over the kaggle's leaderboard first place score -- was obtained by taking the average of the predictions by the neural network and xgboost regressor.

Key takeaways:

  1. Always start with a baseline

  2. Random forests are generally bad at extrapolating, hence, if there is a shift in the domain between the training input and the validation (or test) inputs, then the random forest model will perform rather poorly on the validation set(or test set).

rf_failure

The red portion of the plot above shows the extrapolation problem. The random forest was trained on the first 70% of the data and used to make predictions on thr full data including the last 30%. It fails because there is an obvious linear trend it was unable to properly capture. Moreover, the predictions by random forests are confined within the range of the training input labels, since random forests make predictions by taking the average of previously observed data. Hence, when the input for prediction is

  1. To improve the performance of random forests, you could attempt to find the columns or features on which the training and validation sets differ the most. You may drop the ones that least impacts the accuracy of the model. To achieve this, I trained a random forest that can tell if a given input is from a training set or validation set. This helped me determine if a validation set has the same or similar distribution as the training set. Lastly, I computed the feature importances. The feature importances for this model revealed the degree of dissimilarity of the features between the training and validation sets. The features with high feature importances are the most dissimilar between the sets. salesID and machineID were significantly different between the sets but impacts RMSE the least, hence they were dropped. Other common approaches taken to improve performance include: finding and removing the redundant features by making similarity plot (shown below), choosing more recent data for both the training and the validation sets.

similarity plot

  1. For forecasting tasks (time dependent targets), the validation set should not be arbitrarily chosen i.e train_test_split may not be your best option for splitting the data. Since you are looking to make predictions on future sales, your validation set should contain more recent data, so that if your model is able to do well on the validation set, then, you can be more confident about its predictions in the future.

  2. Data leakage should be investigated. Signs of data leakage include:

    • Unrealistically high level of performance on the test set
    • Apparently meaningless feature(s) scoring very high on feature importance
    • Partial dependence plots that do not make sense.

popularitypartial_dependence

Observations extracted from the notebook*

Towards the end of the productsize plot, we see an interesting trend. The auction price is at its lowest in the end. This group represent the missing values in our product size. Missing values constitute the greatest percentage in our ProductSize. However, recall that productsize is our third most important feature. So, how is it possible that a feature that is missing so often could be so important to the prediction? The answer may be tied to data leakage. We can theorize that the auctions with missing product size information were not really successful since they were sold at very low prices, as a resutlt, the size information were either removed or intentionally omitted. It is also possible that most of these data were collected after sales were made, and for the sales that were not great, the product size were simply left blank. The intention is completely debatable, it might be intended to provide clue as to the nature of the sale, however, such information can harm our model or even render it completely useless. Clearly, our model could be misled into thinking that missing product size is an indication of low price and as such will always predict a low price whenever the product size attribute is missing. A model afflicted with data leakage will not perform well in production.

  1. An histogram based gradient boosting regressor may not be the best for forecasting on time dependent data. It showed the least peroformance with an RMSE of 0.239826

  2. A simple Neural network can show superior performance on structured data. A 2-layer neural network in which the categorical variables (i.e features with cardinality < 1000) were handled using embeddings showed a 1.93% improvement in RMSE compared to the best random forest model. It also outperformed the xgboost regressor even after the hyperparameters were tuned.

  3. There is some benefit to be derived by using an ensemble of models. In this project, each time, the neural network was combined with any of the trees, a superior performance always ensues. The best performance was obtained from the combination of neural network and the xgboost model.

Owner
Mustapha Unubi Momoh
Python Developer| Data scientist
Mustapha Unubi Momoh
The code of paper 'Learning to Aggregate and Personalize 3D Face from In-the-Wild Photo Collection'

Learning to Aggregate and Personalize 3D Face from In-the-Wild Photo Collection Pytorch implemetation of paper 'Learning to Aggregate and Personalize

Tencent YouTu Research 136 Dec 29, 2022
Self-Supervised Learning

Self-Supervised Learning Features self_supervised offers features like modular framework support for multi-gpu training using PyTorch Lightning easy t

Robin 1 Dec 14, 2021
A Dying Light 2 (DL2) PAKFile Utility for Modders and Mod Makers.

Dying Light 2 PAKFile Utility A Dying Light 2 (DL2) PAKFile Utility for Modders and Mod Makers. This tool aims to make PAKFile (.pak files) modding a

RHQ Online 12 Aug 26, 2022
Depth image based mouse cursor visual haptic

Depth image based mouse cursor visual haptic How to run it. Install pyqt5. Install python modules pip install Pillow pip install numpy For illustrati

Xiong Jie 17 Dec 20, 2022
Multiview Dataset Toolkit

Multiview Dataset Toolkit Using multi-view cameras is a natural way to obtain a complete point cloud. However, there is to date only one multi-view 3D

11 Dec 22, 2022
A resource for learning about deep learning techniques from regression to LSTM and Reinforcement Learning using financial data and the fitness functions of algorithmic trading

A tour through tensorflow with financial data I present several models ranging in complexity from simple regression to LSTM and policy networks. The s

195 Dec 07, 2022
Mail classification with tensorflow and MS Exchange Server (ham or spam).

Mail classification with tensorflow and MS Exchange Server (ham or spam).

Metin Karatas 1 Sep 11, 2021
Simple-Image-Classification - Simple Image Classification Code (PyTorch)

Simple-Image-Classification Simple Image Classification Code (PyTorch) Yechan Kim This repository contains: Python3 / Pytorch code for multi-class ima

Yechan Kim 8 Oct 29, 2022
ServiceX Transformer that converts flat ROOT ntuples into columnwise data

ServiceX_Uproot_Transformer ServiceX Transformer that converts flat ROOT ntuples into columnwise data Usage You can invoke the transformer from the co

Vis 0 Jan 20, 2022
A module that used for encrypt code which includes RSA and AES

软件加密模块 requirement: Crypto,pycryptodome,pyqt5 本地加密信息为随机字符串 使用说明 命令行参数 -h 帮助 -checkWorking 检查是否能正常工作,后接1确认指令 -checkEndDate 检查截至日期,后接1确认指令 -activateCode

2 Sep 27, 2022
🐾 Semantic segmentation of paws from cute pet images (PyTorch)

🐾 paw-segmentation 🐾 Semantic segmentation of paws from cute pet images 🐾 Semantic segmentation of paws from cute pet images (PyTorch) 🐾 Paw Segme

Zabir Al Nazi Nabil 3 Feb 01, 2022
Codes for our paper The Stem Cell Hypothesis: Dilemma behind Multi-Task Learning with Transformer Encoders published to EMNLP 2021.

The Stem Cell Hypothesis Codes for our paper The Stem Cell Hypothesis: Dilemma behind Multi-Task Learning with Transformer Encoders published to EMNLP

Emory NLP 5 Jul 08, 2022
Hydra: an Extensible Fuzzing Framework for Finding Semantic Bugs in File Systems

Hydra: An Extensible Fuzzing Framework for Finding Semantic Bugs in File Systems Paper Finding Semantic Bugs in File Systems with an Extensible Fuzzin

gts3.org (<a href=[email protected])"> 129 Dec 15, 2022
[CVPR'22] Official PyTorch Implementation of Collaborative Transformers for Grounded Situation Recognition

[CVPR'22] Collaborative Transformers for Grounded Situation Recognition Paper | Model Checkpoint This is the official PyTorch implementation of Collab

Junhyeong Cho 29 Dec 10, 2022
Base pretrained models and datasets in pytorch (MNIST, SVHN, CIFAR10, CIFAR100, STL10, AlexNet, VGG16, VGG19, ResNet, Inception, SqueezeNet)

This is a playground for pytorch beginners, which contains predefined models on popular dataset. Currently we support mnist, svhn cifar10, cifar100 st

Aaron Chen 2.4k Dec 28, 2022
Generating retro pixel game characters with Generative Adversarial Networks. Dataset "TinyHero" included.

pixel_character_generator Generating retro pixel game characters with Generative Adversarial Networks. Dataset "TinyHero" included. Dataset TinyHero D

Agnieszka Mikołajczyk 88 Nov 17, 2022
TinyML Cookbook, published by Packt

TinyML Cookbook This is the code repository for TinyML Cookbook, published by Packt. Author: Gian Marco Iodice Publisher: Packt About the book This bo

Packt 93 Dec 29, 2022
Official code of the paper "ReDet: A Rotation-equivariant Detector for Aerial Object Detection" (CVPR 2021)

ReDet: A Rotation-equivariant Detector for Aerial Object Detection ReDet: A Rotation-equivariant Detector for Aerial Object Detection (CVPR2021), Jiam

csuhan 334 Dec 23, 2022
Regularizing Nighttime Weirdness: Efficient Self-supervised Monocular Depth Estimation in the Dark (ICCV 2021)

Regularizing Nighttime Weirdness: Efficient Self-supervised Monocular Depth Estimation in the Dark (ICCV 2021) Kun Wang, Zhenyu Zhang, Zhiqiang Yan, X

kunwang 66 Nov 24, 2022
Implementation of FSGNN

FSGNN Implementation of FSGNN. For more details, please refer to our paper Experiments were conducted with following setup: Pytorch: 1.6.0 Python: 3.8

19 Dec 05, 2022