A Guide for Feature Engineering and Feature Selection, with implementations and examples in Python.

Overview

Feature Engineering & Feature Selection

A comprehensive guide [pdf] [markdown] for Feature Engineering and Feature Selection, with implementations and examples in Python.

Motivation

Feature Engineering & Selection is the most essential part of building a useable machine learning project, even though hundreds of cutting-edge machine learning algorithms coming in these days like deep learning and transfer learning. Indeed, like what Prof Domingos, the author of  'The Master Algorithm' says:

“At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used.”

— Prof. Pedro Domingos

001

Data and feature has the most impact on a ML project and sets the limit of how well we can do, while models and algorithms are just approaching that limit. However, few materials could be found that systematically introduce the art of feature engineering, and even fewer could explain the rationale behind. This repo is my personal notes from learning ML and serves as a reference for Feature Engineering & Selection.

Download

Download the PDF here:

Same, but in markdown:

PDF has a much readable format, while Markdown has auto-generated anchor link to navigate from outer source. GitHub sucks at displaying markdown with complex grammar, so I would suggest read the PDF or download the repo and read markdown with Typora.

What You'll Learn

Not only a collection of hands-on functions, but also explanation on Why, How and When to adopt Which techniques of feature engineering in data mining.

  • the nature and risk of data problem we often encounter
  • explanation of the various feature engineering & selection techniques
  • rationale to use it
  • pros & cons of each method
  • code & example

Getting Started

This repo is mainly used as a reference for anyone who are doing feature engineering, and most of the modules are implemented through scikit-learn or its communities.

To run the demos or use the customized function, please download the ZIP file from the repo or just copy-paste any part of the code you find helpful. They should all be very easy to understand.

Required Dependencies:

  • Python 3.5, 3.6 or 3.7
  • numpy>=1.15
  • pandas>=0.23
  • scipy>=1.1.0
  • scikit_learn>=0.20.1
  • seaborn>=0.9.0

Table of Contents and Code Examples

Below is a list of methods currently implemented in the repo.

1. Data Exploration

2. Feature Cleaning

3. Feature Engineering

4. Feature Selection

Key Links and Resources

  • Udemy's Feature Engineering online course

https://www.udemy.com/feature-engineering-for-machine-learning/

  • Udemy's Feature Selection online course

https://www.udemy.com/feature-selection-for-machine-learning

  • JMLR Special Issue on Variable and Feature Selection

http://jmlr.org/papers/special/feature03.html

  • Data Analysis Using Regression and Multilevel/Hierarchical Models, Chapter 25: Missing data

http://www.stat.columbia.edu/~gelman/arm/missing.pdf

  • Data mining and the impact of missing data

http://core.ecu.edu/omgt/krosj/IMDSDataMining2003.pdf

  • PyOD: A Python Toolkit for Scalable Outlier Detection

https://github.com/yzhao062/pyod

  • Weight of Evidence (WoE) Introductory Overview

http://documentation.statsoft.com/StatisticaHelp.aspx?path=WeightofEvidence/WeightofEvidenceWoEIntroductoryOverview

  • About Feature Scaling and Normalization

http://sebastianraschka.com/Articles/2014_about_feature_scaling.html

  • Feature Generation with RF, GBDT and Xgboost

https://blog.csdn.net/anshuai_aw1/article/details/82983997

  • A review of feature selection methods with applications

https://ieeexplore.ieee.org/iel7/7153596/7160221/07160458.pdf

Owner
Yimeng.Zhang
I'm a lovely machine learning learner~
Yimeng.Zhang
→ Plantilla de registro para Python

🔧 Pasos Necesarios CMD 🖥️ SOCKETS pip install sockets 🎨 COLORAMA pip install colorama 💻 Código register-by-inputs from turtle import color # Impor

Panda.xyz 4 Mar 12, 2022
PSP (Python Starter Package) is meant for those who want to start coding in python but are new to the coding scene.

Python Starter Package PSP (Python Starter Package) is meant for those who want to start coding in python, but are new to the coding scene. We include

Giter/ 1 Nov 20, 2021
Team Hash Brown Science4Cast Submission

Team Hash Brown Science4Cast Submission This code reproduces Team Hash Brown's (@princengoc, @Xieyangxinyu) best submission (ee5a) for the competition

3 Feb 02, 2022
Find your desired product in Digikala using this app.

Digikala Search Find your desired product in Digikala using this app. با این برنامه محصول مورد نظر خود را در دیجیکالا پیدا کنید. About me Full name: M

Matin Ardestani 17 Sep 15, 2022
A simple website-based resource monitor for slurm system.

Slurm Web A simple website-based resource monitor for slurm system. Screenshot Required python packages flask, colored, humanize, humanfriendly, beart

Tengda Han 17 Nov 29, 2022
Data wrangling & common calculations for results from qMem measurement software

qMem Datawrangler This script processes output of qMem measurement software into an Origin ® compatible *.csv files and matplotlib graphs to quickly v

Julian 1 Nov 30, 2021
Palestra sobre desenvolvimento seguro de imagens e containers para a DockerCon 2021 sala Brasil

Segurança de imagens e containers direto na pipeline Palestra sobre desenvolvimento seguro de imagens e containers para a DockerCon 2021 sala Brasil.

Fernando Guisso 10 May 19, 2022
Placeholders is a single-unit storage solution for your Frontend.

Placeholder Placeholders is a single-unit file storage solution for your Frontend. Why Placeholder? Generally, when a website/service requests for fil

Tanmoy Sen Gupta 1 Nov 09, 2021
Roblox Limited Sniper For Python

Info this is version 2.1 version 3 will support more options (install python: https://www.python.org) the program will buy any limited item with a pri

1 Dec 09, 2021
Fork of pathlib aiming to support the full stdlib Python API.

pathlib2 Fork of pathlib aiming to support the full stdlib Python API. The old pathlib module on bitbucket is in bugfix-only mode. The goal of pathlib

Jazzband 73 Dec 23, 2022
Anki Addon idea by gbrl.sc to see previous ratings of a card in the reviewer

Card History At A Glance Stop having to press card browser and ctrl+i for every card and then WINCING to see it's history of reviews FEATURES Visualiz

Jerry Zhou 11 Dec 19, 2022
Completed task 1 and task 2 at LetsGrowMore as a data science intern.

LetsGrowMore-Internship Completed task 1 and task 2 at LetsGrowMore as a data science intern. Task 1- Task 2- Creating a Decision Tree classifier and

Sanjyot Panure 1 Jan 16, 2022
Batch generate asset browser previews

When dealing with hundreds of library files it becomes tedious to mark their contents as assets. Using python to automate the process is a perfect fit

54 Dec 24, 2022
A Python version of Canvacord

A copy of canvacord made in python! Table of contents Installation Examples Creating Images Links Downloads Installation Run any of these commands in

10 Mar 28, 2022
Functions to analyze Cell-ID single-cell cytometry data using python language.

PyCellID (building...) Functions to analyze Cell-ID single-cell cytometry data using python language. Dependecies for this project. attrs(=21.1.0) fo

0 Dec 22, 2021
A powerful and user-friendly binary analysis platform!

angr angr is a platform-agnostic binary analysis framework. It is brought to you by the Computer Security Lab at UC Santa Barbara, SEFCOM at Arizona S

6.3k Jan 02, 2023
Self sustained producer-consumer(prosumer) policy study using Python and Gurobi

Prosumer Policy This project aims to model the optimum dispatch behaviour of households with PV and battery systems under different policy instrument

Tom Xu 3 Aug 31, 2022
creates a batch file that uses adb to auto-install apks into the Windows Subsystem for Android and registers it as the default application to open apks.

wsa-apktool creates a batch file that uses adb to auto-install apks into the Windows Subsystem for Android and registers it as the default application

Aditya Vikram 3 Apr 05, 2022
Provides guideline on how to configure pre-commit hooks in your own python project

Pre-commit Configuration Guide The main aim of this repository is to act as a guide on how to configure the pre-commit hooks in your existing python p

Faraz Ahmed Khan 2 Mar 31, 2022
Active Transport Analytics Model: A new strategic transport modelling and data visualization framework

{ATAM} Active Transport Analytics Model Active Transport Analytics Model (“ATAM”

ATAM Analytics 2 Dec 21, 2022