Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python

Overview

Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python 📊

Last updated on January 30, 2022 by Thomas J. Nicoletti

I would like to preface this document by stating this is my second major project using Python. From my first project to now, I certainly improved upon my understanding of and proficiency with Python, though I still have a long journey ahead of me. I aim to keep learning more and more everyday, and hope this project provides some benefit to the greater applied social science community.

The purpose of this data mining script is to use random forest classification, in conjunction with factor analysis and other analytic techniques, to automatically yield feature importance metrics and related output for a driver analysis. Driver analysis quantifies the importance of independent variables (i.e., drivers) in predicting some outcome variable. Within this repository is a basic, simulated dataset created by me, containing five independent variables and one outcome variable. I am by no means an expert in simulating datasets, so I encourage everyone to use real-world data as a stress test for this statistical tool.

This tool will communicate with users using simple inputs via the Command Prompt. Once all mandatory and optional inputs are received, the analysis will run and send relevant information to the source folder; this potentially includes text files, images, and data files useful for model comprehension and validation, as well as statistically- and conceptually-informed decision-making. The most useful outputs will include the automatically generated feature importance plot and feature quadrant chart.

💻 Installation and Preparation

Please note that excerpts of code provided below are examples based on the driver.py script. As a self-taught programmer, I suggest reading through my insights, mixing them with a quick Google search and your own experiences, and then delving into the script itself.

For this project, I used Python 3.9, the Microsoft Windows operating system, and Microsoft Excel. As such, these act as the prerequisites for utilizing this repository successfully without any additional troubleshooting. Going forward, please ensure everything you download or install for this project ends up in the correct location (e.g., the same source folder).

Use pip to install relevant packages to the proper source folder using the Command Prompt and correct PATH. For example:

pip install numpy
pip install pandas

Please be sure to install each of the following packages: easygui, matplotlib, numpy, pandas, seaborn, string, factor_analyzer, scipy, sklearn, and statsmodels. If required, use the first section of the script to determine lacking dependencies, and proceed accordingly.

📑 Script Breakdown

The script begins by calling relevant libraries in Python, as well as defining Mahalanobis distance, which is used to identify multivariate outliers in a later step of this project. Additionally, the Command Prompt will read a simple set of instructions for the user, including important information regarding categorical features, the location of the outcome variable within the dataset, and a required revision for missing data. Furthermore, the script will allow the user to specify a random seed for easy replication of this driver analysis at a later date:

import easygui
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
...
def mahalanobis(x = None, data = None, cov = None):
	mu = x - np.mean(data)
    ...
	return mah.diagonal()
...
seed = int(input('Please enter your numerical random seed for replication purposes: '))
np.random.seed(seed)
text = open('random_seed.txt', 'w')

The script has an entire section dedicated to understanding your dataset, including a quick process for uploading your data file, removing missing data, adding an outlier status variable, determining the final sample size, classifying variables, and so on:

df = pd.read_csv(easygui.fileopenbox())
df.dropna(inplace = True)
df['Mahalanobis'] = mahalanobis(x = df, data = df.iloc[:, :len(df.columns)], cov = None)
df['PValue'] = 1 - chi2.cdf(df['Mahalanobis'], len(df.columns) - 1)
...
n = df.shape[0]
text = open('sample_size.txt', 'w')
...
x = df.iloc[:, :-1]
y = np.ravel(df.iloc[:, -1])
feat = df.columns[:-1]
mean = np.array(df.describe().loc['mean'][:-1])

The script then checks for relevant statistical assumptions needed before determining if your dataset is appropriate for factor analysis. This includes Bartlett's Test of Sphericity and the Kaiser-Meyer-Olkin Test. Additionally, a scree plot is produced using principal components analysis to assist in factor analysis decision-making. Once all of this is reviewed, the user will provide relevant inputs regarding their driver analysis model:

bart = calculate_bartlett_sphericity(x)
bart = (str(round(bart[1], 2)))
text = open('sphericity.txt', 'w')
...
kmo = calculate_kmo(x)
kmo = (str(round(kmo[1], 2)))
text = open('factorability.txt', 'w')
...
pca = PCA()
pca.fit(x)
comp = np.arange(pca.n_components_)
plt.figure()

When it comes to choosing whether to run random forest classification on the original variables or transformed factors, the above information is critical. The user will be able to decide both A) whether or not to use factor analysis, and B) how many factors should be used in extraction if applicable. Additionally, if the user opts for the factor analysis route, they will also be able to determine whether all the factors or just the highest loading variable per factor should be used (please see lines 139-150 in the script). The following optional factor analysis and mandatory core analyses will run based on user specifications from the previous step:

fa = FactorAnalysis(n_components = factor, max_iter = 3000, rotation = 'varimax')
...
x = fa.transform(x)
...
load = pd.DataFrame(fa.components_.T.round(2), columns = cols, index = feat)
load.to_csv('factor_loadings.csv')
...
vif = pd.Series(variance_inflation_factor(x.values, i) for i in range(x.shape[1]))
vif = pd.DataFrame(np.array(vif.round(2)), columns = ['Variable Inflation Factor'], index = feat)
vif.T.to_csv('variable_inflation_factors.csv')
clf = RandomForestClassifier(n_estimators = 100, criterion = 'gini', max_features = 'auto', bootstrap = True, oob_score = True, class_weight = 'balanced').fit(x, y)
oob = str(round(clf.oob_score_, 2)).ljust(4, '0')
pred = clf.predict_proba(x)
loss = str(round(log_loss(y, pred), 2)).ljust(4, '0')
perf = pd.DataFrame({'Out-of-Bag Score': oob, 'Log Loss': loss}, index = ['Estimate'])
perf.to_csv('model_performance.csv')

Please note, the only current rotation method available in Python for factor analysis is varimax, as far as I know. If another rotation method is preferred, I would opt out of the factor analysis route, or try implementing your own solution from scratch. From these results, the feature importance plot and its respective feature quadrant chart can be graphed and saved automatically to the source folder. This is an especially useful and efficient data visualization tool to help express which variable(s) are most important in predicting your outcome. It also saves you quite a bit of time compared to graphing it yourself!

imp = clf.feature_importances_
sort = np.argsort(imp)
plt.figure()
plt.barh(range(len(sort)), imp[sort], color = 'mediumaquamarine', align = 'center')
plt.title('Feature Importance Plot')
plt.xlabel('Derived Importance →')
...
imps = []
score = []
for i, feat in enumerate(imp[sort]):
  imps.append(round(feat / imp[sort].mean() * 100, 0))
for i, feat in enumerate(mean[sort]):
  score.append(round(feat / mean[sort].mean() * 100, 0))
quad = pd.DataFrame({'Rescaled Observed Score →': score, 'Rescaled Derived Importance →': imps,
  'Feature': x.columns[sort]})

To run the script, I suggest using a batch file located in the source folder as follows:

python driver.py
PAUSE

Although the entire script is not reflected in the above breakdown, this information should prove helpful in getting the user accustomed to what this script aims to achieve. If any additional information and/or explanations are desired, please do not hesitate in reaching out!

📋 Next Steps

Although I feel this project is solid in its current state, I think one area of improvement would fall in the realm of optimizing the script and making it more pythonic. I am also quite interested in hearing feedback from users, including their field of practice, which variables they used for their analyses, and how satisfied they were with this statistical tool overall.

💡 Community Contribution

I am always happy to receive feedback, recommendations, and/or requests from anyone, especially new learners. Please click here for information about the license for this project.

Project Support

Please let me know if you plan to make changes to this project, or adapt the script to a project of your own interest. We can certainly collaborate to make this process as painless as possible!

📚 Additional Resources

  • My current work in market research introduced me to the idea of driver analysis and its usefulness; this statistical tool was created with that space in mind, though it is certainly applicable to all applied areas of business and social science
  • To learn more about calculating random forest classification in Python, click here to access scikit-learn
  • To learn more about calculating factor analysis in Python, click here to access scikit-learn
  • For easy-to-use text editing software, check out Sublime Text for Python and Atom for Markdown
Owner
Thomas
With a passion for research, I am eager to build upon my knowledge of statistical programming. My current areas of focus include data mining and psychometrics.
Thomas
yt is an open-source, permissively-licensed Python library for analyzing and visualizing volumetric data.

The yt Project yt is an open-source, permissively-licensed Python library for analyzing and visualizing volumetric data. yt supports structured, varia

The yt project 367 Dec 25, 2022
Sentiment analysis on streaming twitter data using Spark Structured Streaming & Python

Sentiment analysis on streaming twitter data using Spark Structured Streaming & Python This project is a good starting point for those who have little

Himanshu Kumar singh 2 Dec 04, 2021
Streamz helps you build pipelines to manage continuous streams of data

Streamz helps you build pipelines to manage continuous streams of data. It is simple to use in simple cases, but also supports complex pipelines that involve branching, joining, flow control, feedbac

Python Streamz 1.1k Dec 28, 2022
HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets

HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets that can be described as multidimensional arrays o

HyperSpy 411 Dec 27, 2022
A data structure that extends pyspark.sql.DataFrame with metadata information.

MetaFrame A data structure that extends pyspark.sql.DataFrame with metadata info

Invent Analytics 8 Feb 15, 2022
Stock Analysis dashboard Using Streamlit and Python

StDashApp Stock Analysis Dashboard Using Streamlit and Python If you found the content useful and want to support my work, you can buy me a coffee! Th

StreamAlpha 27 Dec 09, 2022
Flood modeling by 2D shallow water equation

hydraulicmodel Flood modeling by 2D shallow water equation. Refer to Hunter et al (2005), Bates et al. (2010). Diffusive wave approximation Local iner

6 Nov 30, 2022
DaCe is a parallel programming framework that takes code in Python/NumPy and other programming languages

aCe - Data-Centric Parallel Programming Decoupling domain science from performance optimization. DaCe is a parallel programming framework that takes c

SPCL 330 Dec 30, 2022
pyhsmm MITpyhsmm - Bayesian inference in HSMMs and HMMs. MIT

Bayesian inference in HSMMs and HMMs This is a Python library for approximate unsupervised inference in Bayesian Hidden Markov Models (HMMs) and expli

Matthew Johnson 527 Dec 04, 2022
Handle, manipulate, and convert data with units in Python

unyt A package for handling numpy arrays with units. Often writing code that deals with data that has units can be confusing. A function might return

The yt project 304 Jan 02, 2023
Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

Karate Club is an unsupervised machine learning extension library for NetworkX. Please look at the Documentation, relevant Paper, Promo Video, and Ext

Benedek Rozemberczki 1.8k Jan 09, 2023
A Numba-based two-point correlation function calculator using a grid decomposition

A Numba-based two-point correlation function (2PCF) calculator using a grid decomposition. Like Corrfunc, but written in Numba, with simplicity and hackability in mind.

Lehman Garrison 3 Aug 24, 2022
A Python package for modular causal inference analysis and model evaluations

Causal Inference 360 A Python package for inferring causal effects from observational data. Description Causal inference analysis enables estimating t

International Business Machines 506 Dec 19, 2022
This python script allows you to manipulate the audience data from Sl.ido surveys

Slido-Automated-VoteBot This python script allows you to manipulate the audience data from Sl.ido surveys Since Slido blocks interference from automat

Pranav Menon 1 Jan 24, 2022
Shot notebooks resuming the main functions of GeoPandas

Shot notebooks resuming the main functions of GeoPandas, 2 notebooks written as Exercises to apply these functions.

1 Jan 12, 2022
ICLR 2022 Paper submission trend analysis

Visualize ICLR 2022 OpenReview Data

Jintang Li 75 Dec 06, 2022
Python reader for Linked Data in HDF5 files

Linked Data are becoming more popular for user-created metadata in HDF5 files.

The HDF Group 8 May 17, 2022
Sample code for Harry's Airflow online trainng course

Sample code for Harry's Airflow online trainng course You can find the videos on youtube or bilibili. I am working on adding below things: the slide p

102 Dec 30, 2022
Collections of pydantic models

pydantic-collections The pydantic-collections package provides BaseCollectionModel class that allows you to manipulate collections of pydantic models

Roman Snegirev 20 Dec 26, 2022
Detailed analysis on fraud claims in insurance companies, gives you information as to why huge loss take place in insurance companies

Insurance-Fraud-Claims Detailed analysis on fraud claims in insurance companies, gives you information as to why huge loss take place in insurance com

1 Jan 27, 2022