Mining the Stack Overflow Developer Survey

Overview

Mining the Stack Overflow Developer Survey

A prototype data mining application to compare the accuracy of decision tree and random forest regression models to predict annual compensation of tech workers in the US and Europe.

Objectives

Usage

To run, download the repository and execute the file main.py in the src directory with your python path variable. For example, python3 main.py.

Dependencies

  • python 3.8.1 and up
  • pandas 1.3.4 and up
  • matplotlib 3.4.3 and up
  • numpy 1.21.0 and up
  • sklearn 1.0.1 and up

Methodology

Preprocessing

The original data set provided by Stack Overflow contained 48 attribute columns and 83439 data records. Due to the large size of the data set, we wanted to narrow our focus to a certain subset of the data. In the preprocessing of the original data file, we decided to discard any records that were not employed full-time in the technology industry. Any record that did not contain country, converted annual salary, or yeared coded was also discarded, as this data is vital to our model. We also discarded some of the columns from the original data set that were open-ended. Out of the records that fit our requirements, we exported them to two output csv files. Records of United States data were put together in one output file, and records of European countries were put in the other. Data from any other countries were discarded. Once we have the two cleaned files, we applied additional preprocessing techniques. Any missing attributes that remained were replaced with 'NA' if the attributes were nominal. Two special cases existed in the columns for years coded and years coded professionally. Most contained a numerical value for the years, but some had a string for 'Less than one year' and 'More than 50 years'. These strings were replaced with 0 and 50, respectively, to keep these columns numerical. With these preprocessing steps complete, the data files are now ready to be processed to generate the models.

Models

We evaluated a variety of data mining models and algorithms to find the ones that would make the most sense for our data set and objectives. With our goal of predicting a numerical value for annual salary, we knew we needed to use a compatible regression model. We found regression models for decision trees and random forests and wanted to compare their accuracy. We wanted to see how the accuracy of a single decision tree compares to the accuracy of a random forest model, which is a number of trees together. The results are detailed in the results and analysis section. Below are the implementation details of each model.

Decision tree model

We selected the DecisionTreeRegressor model from the Scikit Learn machine learning package. In order to get the most accurate model, we trained several models with different parameters and selected the one with the highest accuracy to validate. The parameter we changed was the maximum depth level of each tree. Additional factors that affect the model are the testing split percentage and the cross validation folds. For our models, we used 20% of the data as testing and 80% as training and a cross validation value of 10. Out of every combination we tried, we found that a maximum depth of ADD RES HERE resulted in the most accurate decision tree model. The accuracy of the model was ADD RES HERE. This model will output the tree itself, several statistics of the model such as R-squared, mean absolute error, and mean squared error, and the ten attributes that have the largest weight in determining the result. With the best model selected, we then validated it against the testing data set. These steps of model generation were done for both the US data and the European data.

Random forest model

We selected the RandomForestRegressor model from the Scikit Learn machine learning package. In order to get the most accurate model, we trained several models with different parameters and selected the one with the highest accuracy to validate. The parameters we changed were the number of trees to estimate with and the maximum depth level of each tree. Additional factors that affect the model are the testing split percentage and the cross validation folds. For our models, we used 20% of the data as testing and 80% as training and a cross validation value of 10. Out of every combination we tried, we found that ADD RES HERE trees in the forest with a maximum depth of ADD RES HERE resulted in the most accurate random forest model. The accuracy of the model was ADD RES HERE. This model will output the tree itself, several statistics of the model such as R-squared, mean absolute error, and mean squared error, and the ten attributes that have the largest weight in determining the result. With the best model selected, we then validated it against the testing data set. These steps of model generation were done for both the US data and the European data.

Results and Analysis

Authors

PyPSA: Python for Power System Analysis

1 Python for Power System Analysis Contents 1 Python for Power System Analysis 1.1 About 1.2 Documentation 1.3 Functionality 1.4 Example scripts as Ju

758 Dec 30, 2022
Convert tables stored as images to an usable .csv file

Convert an image of numbers to a .csv file This Python program aims to convert images of array numbers to corresponding .csv files. It uses OpenCV for

711 Dec 26, 2022
A lightweight, hub-and-spoke dashboard for multi-account Data Science projects

A lightweight, hub-and-spoke dashboard for cross-account Data Science Projects Introduction Modern Data Science environments often involve many indepe

AWS Samples 3 Oct 30, 2021
Modular analysis tools for neurophysiology data

Neuroanalysis Modular and interactive tools for analysis of neurophysiology data, with emphasis on patch-clamp electrophysiology. Functions for runnin

Allen Institute 5 Dec 22, 2021
Demonstrate a Dataflow pipeline that saves data from an API into BigQuery table

Overview dataflow-mvp provides a basic example pipeline that pulls data from an API and writes it to a BigQuery table using GCP's Dataflow (i.e., Apac

Chris Carbonell 1 Dec 03, 2021
Integrate bus data from a variety of sources (batch processing and real time processing).

Purpose: This is integrate bus data from a variety of sources such as: csv, json api, sensor data ... into Relational Database (batch processing and r

1 Nov 25, 2021
A project consists in a set of assignements corresponding to a BI process: data integration, construction of an OLAP cube, qurying of a OPLAP cube and reporting.

TennisBusinessIntelligenceProject - A project consists in a set of assignements corresponding to a BI process: data integration, construction of an OLAP cube, qurying of a OPLAP cube and reporting.

carlo paladino 1 Jan 02, 2022
Udacity-api-reporting-pipeline - Udacity api reporting pipeline

udacity-api-reporting-pipeline In this exercise, you'll use portions of each of

Fabio Barbazza 1 Feb 15, 2022
Using Python to derive insights on particular Pokemon, Types, Generations, and Stats

Pokémon Analysis Andreas Nikolaidis February 2022 Introduction Exploratory Analysis Correlations & Descriptive Statistics Principal Component Analysis

Andreas 1 Feb 18, 2022
Recommendations from Cramer: On the show Mad-Money (CNBC) Jim Cramer picks stocks which he recommends to buy. We will use this data to build a portfolio

Backtesting the "Cramer Effect" & Recommendations from Cramer Recommendations from Cramer: On the show Mad-Money (CNBC) Jim Cramer picks stocks which

Gábor Vecsei 12 Aug 30, 2022
BIGDATA SIMULATION ONE PIECE WORLD CENSUS

ONE PIECE is a Japanese manga of great international success. The story turns inhabited in a fictional world, tells the adventures of a young man whose body gained rubber properties after accidentall

Maycon Cypriano 3 Jun 30, 2022
Python Practicum - prepare for your Data Science interview or get a refresher.

Python-Practicum Python Practicum - prepare for your Data Science interview or get a refresher. Data Data visualization using data on births from the

Jovan Trajceski 1 Jul 27, 2021
CaterApp is a cross platform, remotely data sharing tool created for sharing files in a quick and secured manner.

CaterApp is a cross platform, remotely data sharing tool created for sharing files in a quick and secured manner. It is aimed to integrate this tool with several more features including providing a U

Ravi Prakash 3 Jun 27, 2021
Tools for the analysis, simulation, and presentation of Lorentz TEM data.

ltempy ltempy is a set of tools for Lorentz TEM data analysis, simulation, and presentation. Features Single Image Transport of Intensity Equation (SI

McMorran Lab 1 Dec 26, 2022
Sensitivity Analysis Library in Python (Numpy). Contains Sobol, Morris, Fractional Factorial and FAST methods.

Sensitivity Analysis Library (SALib) Python implementations of commonly used sensitivity analysis methods. Useful in systems modeling to calculate the

SALib 663 Jan 05, 2023
Incubator for useful bioinformatics code, primarily in Python and R

Collection of useful code related to biological analysis. Much of this is discussed with examples at Blue collar bioinformatics. All code, images and

Brad Chapman 560 Jan 03, 2023
💬 Python scripts to parse Messenger, Hangouts, WhatsApp and Telegram chat logs into DataFrames.

Chatistics Python 3 scripts to convert chat logs from various messaging platforms into Pandas DataFrames. Can also generate histograms and word clouds

Florian 893 Jan 02, 2023
Codes for the collection and predictive processing of bitcoin from the API of coinmarketcap

Codes for the collection and predictive processing of bitcoin from the API of coinmarketcap

Teo Calvo 5 Apr 26, 2022
Stitch together Nanopore tiled amplicon data without polishing a reference

Stitch together Nanopore tiled amplicon data using a reference guided approach Tiled amplicon data, like those produced from primers designed with pri

Amanda Warr 14 Aug 30, 2022