A Guide for Feature Engineering and Feature Selection, with implementations and examples in Python.

Overview

Feature Engineering & Feature Selection

A comprehensive guide [pdf] [markdown] for Feature Engineering and Feature Selection, with implementations and examples in Python.

Motivation

Feature Engineering & Selection is the most essential part of building a useable machine learning project, even though hundreds of cutting-edge machine learning algorithms coming in these days like deep learning and transfer learning. Indeed, like what Prof Domingos, the author of  'The Master Algorithm' says:

“At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used.”

— Prof. Pedro Domingos

001

Data and feature has the most impact on a ML project and sets the limit of how well we can do, while models and algorithms are just approaching that limit. However, few materials could be found that systematically introduce the art of feature engineering, and even fewer could explain the rationale behind. This repo is my personal notes from learning ML and serves as a reference for Feature Engineering & Selection.

Download

Download the PDF here:

Same, but in markdown:

PDF has a much readable format, while Markdown has auto-generated anchor link to navigate from outer source. GitHub sucks at displaying markdown with complex grammar, so I would suggest read the PDF or download the repo and read markdown with Typora.

What You'll Learn

Not only a collection of hands-on functions, but also explanation on Why, How and When to adopt Which techniques of feature engineering in data mining.

  • the nature and risk of data problem we often encounter
  • explanation of the various feature engineering & selection techniques
  • rationale to use it
  • pros & cons of each method
  • code & example

Getting Started

This repo is mainly used as a reference for anyone who are doing feature engineering, and most of the modules are implemented through scikit-learn or its communities.

To run the demos or use the customized function, please download the ZIP file from the repo or just copy-paste any part of the code you find helpful. They should all be very easy to understand.

Required Dependencies:

  • Python 3.5, 3.6 or 3.7
  • numpy>=1.15
  • pandas>=0.23
  • scipy>=1.1.0
  • scikit_learn>=0.20.1
  • seaborn>=0.9.0

Table of Contents and Code Examples

Below is a list of methods currently implemented in the repo.

1. Data Exploration

2. Feature Cleaning

3. Feature Engineering

4. Feature Selection

Key Links and Resources

  • Udemy's Feature Engineering online course

https://www.udemy.com/feature-engineering-for-machine-learning/

  • Udemy's Feature Selection online course

https://www.udemy.com/feature-selection-for-machine-learning

  • JMLR Special Issue on Variable and Feature Selection

http://jmlr.org/papers/special/feature03.html

  • Data Analysis Using Regression and Multilevel/Hierarchical Models, Chapter 25: Missing data

http://www.stat.columbia.edu/~gelman/arm/missing.pdf

  • Data mining and the impact of missing data

http://core.ecu.edu/omgt/krosj/IMDSDataMining2003.pdf

  • PyOD: A Python Toolkit for Scalable Outlier Detection

https://github.com/yzhao062/pyod

  • Weight of Evidence (WoE) Introductory Overview

http://documentation.statsoft.com/StatisticaHelp.aspx?path=WeightofEvidence/WeightofEvidenceWoEIntroductoryOverview

  • About Feature Scaling and Normalization

http://sebastianraschka.com/Articles/2014_about_feature_scaling.html

  • Feature Generation with RF, GBDT and Xgboost

https://blog.csdn.net/anshuai_aw1/article/details/82983997

  • A review of feature selection methods with applications

https://ieeexplore.ieee.org/iel7/7153596/7160221/07160458.pdf

Owner
Yimeng.Zhang
I'm a lovely machine learning learner~
Yimeng.Zhang
A simple python script that print the Mandelbrot set for every power of the formal formula.

Python Mandelbrot A simple python script that print the Mandelbrot set for every power of the formal formula.

Paride Giunta 2 Apr 15, 2022
This is a simple analogue clock made with turtle in python...

Analogue-Clock This is a simple analogue clock made with turtle in python... Requirements None, only you need to have windows 😉 ...Enjoy! Installatio

Abhyush 3 Jan 14, 2022
Python library and cli util for https://www.zerochan.net/

Zerochan Library for Zerochan.net with pics parsing and downloader included! Features CLI utility for pics downloading from zerochan.net Library for c

kiriharu 10 Oct 11, 2022
Buffer overflow example for python

Buffer overflow example for python

Mehmet 1 Jan 04, 2022
A small program to vote for Councilors at 42 Heilbronn.

This Docker container is build to run on server an provide an easy to use interface for every student to vote for their councillors. To run docker on

Kevin Hirsig 2 Jan 17, 2022
Agora-token-helper - Some help tools for AgoraToken

Agora Token Helper Support AgoraToken version 001 - 006. But for security reason

A MCPI hack with many features.

Morpheus 2.0 A MCPI hack with many features To Use: You will need to install the keyboard, pysimplegui, and MCPI python modules and you will need to e

11 Oct 11, 2022
Repo with data from local elections in Portugal from 2009 to 2013

autarquicas - local elections in Portugal Repo with data from local elections in Portugal from 2009 to 2013 Objective To provide, to all, raw data fro

Jorge Gomes 6 Apr 06, 2022
A tool for checking if the external data used in Flatpak manifests is still up to date

Flatpak External Data Checker This is a tool for checking for outdated or broken links of external data in Flatpak manifests. Motivation Flatpak apps

Flathub 76 Dec 24, 2022
Experimental proxy for dumping the unencrypted packet data from Brawl Stars (WIP)

Brawl Stars Proxy Experimental proxy for version 39.99 of Brawl Stars. It allows you to capture the packets being sent between the Brawl Stars client

4 Oct 29, 2021
Type Persian without confusing words for yourself and others, in Adobe Connect

About In the Adobe Connect chat section, to type in Persian or Arabic, the written words will be confused and will be written and sent illegibly (This

Matin Najafi 23 Nov 26, 2021
Мой первый калькулятор!!!!!!

my_first_calculator Первый калькулятор созданный мною на питоне Версия калькулятора: 0.0.4 Как скачать? TERMUX Для скрипта нужен питон, скачиваем pkg

Lesha Russkiyov 2 Dec 29, 2021
A simple script for generating screenshots with Vapoursynth

Vapoursynth-Screenshots A simple script for generating screenshots with Vapoursynth. About I'm lazy, and hate changing variables for each batch of scr

7 Dec 31, 2022
Push a record and you will receive a email when that date

Push a record and you will receive a email when that date

5 Nov 28, 2022
This script can be used to get unlimited Gb for WARP.

Warp-Unlimited-GB This script can be used to get unlimited Gb for WARP. How to use Change the value of the 'referrer' to warp id of yours You can down

Anix Sam Saji 1 Feb 14, 2022
A powerful and user-friendly binary analysis platform!

angr angr is a platform-agnostic binary analysis framework. It is brought to you by the Computer Security Lab at UC Santa Barbara, SEFCOM at Arizona S

6.3k Jan 02, 2023
GUI for the Gammu library.

Wammu GUI for the Gammu library. Homepage https://wammu.eu/ License GNU GPL version 3 or later. First start On first start you will be asked for set

Gammu 60 Dec 14, 2022
Osu statistics right on your desktop, made with pyqt

Osu!Stat Osu statistics right on your desktop, made with Qt5 Credits Would like to thank these creators for their projects and contributions. ppy, osu

Aditya Gupta 21 Jul 13, 2022
This repository contains various tools useful for offensive operations (reversing, etc) regarding the PE (Portable Executable) format

PE-Tools This repository contains various tools useful for offensive operations (reversing, etc) regarding the PE (Portable Executable) format Install

stark0de 4 Oct 13, 2022
This is an implementation of NeuronJ work with python.

NeuronJ This is an implementation of NeuronJ work with python. NeuronJ is a plug-in for ImageJ that allows you to create and edit neurons masks. Image

Mohammad Mahdi Samei 3 Aug 28, 2022