Using Python to derive insights on particular Pokemon, Types, Generations, and Stats

Overview

Pokémon Analysis

Andreas Nikolaidis

February 2022

Introduction

In this project, I use Python to analayze stats on all Pokemon in Generations 1 - 8, and calculate some interesting statistics based on a number of factors.

We can use this data to answer questions such as:

  • Does a Pokemon's Type determine it's stats like: HP, Attack, Defense, etc.?
  • What is the most important stat for predicting other stats? i.e. which stats have a high correlation?

In the following sections, I will walk through my process of extracting and analyzing the information using in pandas DataFrames, creating some visualizations and perform modeling using scikit-learn.

Exploratory Analysis

Start by importing all the necessary packages into Python:

import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import warnings
warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt 
import seaborn as sns 
import plotly.express as px
import plotly.graph_objects as go

sns.set_style('whitegrid')
%matplotlib inline

# Import for Linear Regression
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.cluster import KMeans

Read Data File:

df = pd.read_excel("pokemon.xlsx")

Create a separate dataframe including just the necessary stats:

df_stats = df[["Name","HP","Attack","Defense","SP_Attack","SP_Defense","Speed"]]

Although each stat is important in it's own right, the total value of all stats is what determines the category of a pokemon, therefore let's concatenate a column into the df that sums up the total values:

df['total'] = df.HP + df.Attack + df.Defense + df.SP_Attack + df.SP_Defense + df.Speed

Now let's view the range of total stats by each generation:

#palette: https://seaborn.pydata.org/tutorial/color_palettes.html?highlight=color
plt.figure(figsize=(13,10), dpi=80)
sns.violinplot(x='Gen', y='total', data=df, scale='width', inner='quartile', palette='Set2') 
plt.title('Violin Plot of Total Stats by Generation', fontsize=22)
plt.show()

2df65225-732a-4581-af16-46cbaf14b931

In the above violinplot we can see that each generation has quite a different range of total stats with Gens IV, VII, & VIII having the longest range, while Gen V had a relatively tight range of stats. All Generations from IV onwards had higher medians than the first 3 generations.

Looking at individual stats, Speed is one of (if not THE) most important stat in competitive play, so let's examine which generations had the best overall speed stats.

plt.figure(figsize=(13,10), dpi=80)
sns.violinplot(x='Gen', y='Speed', data=df, scale='width', inner='quartile', palette='Set2')

plt.title('Violin Plot of Total Stats by Generation', fontsize=22)
plt.show()

speed

Here we can clearly see Generation VIII has some of the fastest pokemon ever seen in games. Let's create a function to return the top 10 fastest pokemon in Gen VIII and their respective speed stat values:

def top_n(df, category, n):
    return (df.loc[df['Gen'] == 'VIII'].sort_values(category, ascending=False)[['Name','Gen',category]].head(n))
print('Top 10 Pokemon Speed')
top_n(df, 'Speed', 10)

speed_gen8

Those are definitely some fast pokemon!

Let's now see if we can get any indication of whether a particular pokemon's type has an advantage over others in total stats.

types_color_dict = {
    'grass':'#8ED752', 'fire':'#F95643', 'water':'#53AFFE', 'bug':"#C3D221", 'normal':"#BBBDAF", \
    'poison': "#AD5CA2", 'electric':"#F8E64E", 'ground':"#F0CA42", 'fairy':"#F9AEFE", \
    'fighting':"#A35449", 'psychic':"#FB61B4", 'rock':"#CDBD72", 'ghost':"#7673DA", \
    'ice':"#66EBFF", 'dragon':"#8B76FF", 'dark':"#1A1A1A", 'steel':"#C3C1D7", 'flying':"#75A4F9" }

plt.figure(figsize=(15,12), dpi=80)
sns.violinplot(x='Primary', y='total', data=df, scale='width', inner='quartile', palette=types_color_dict)

plt.title('Violin Plot of Total Stats by Type', fontsize=20)
plt.show()

total_type_stats

The dragon type definitely has quite a high upper interquartile range compared to other types. Meanwhile water & fairy types seem to have quite a large variance in total stats.

Let's see what the most common type of pokemon is:

types_color_dict = {
    'grass':'#8ED752', 'fire':'#F95643', 'water':'#53AFFE', 'bug':"#C3D221", 'normal':"#BBBDAF", \
    'poison': "#AD5CA2", 'electric':"#F8E64E", 'ground':"#F0CA42", 'fairy':"#F9AEFE", \
    'fighting':"#A35449", 'psychic':"#FB61B4", 'rock':"#CDBD72", 'ghost':"#7673DA", \
    'ice':"#66EBFF", 'dragon':"#8B76FF", 'dark':"#1A1A1A", 'steel':"#C3C1D7", 'flying':"#75A4F9" }


Type1 = pd.value_counts(df['Primary'])
sns.set()
dims = (11.7,8.27) #A4 dimensions
fig, ax=plt.subplots(figsize=dims)
BarT = sns.barplot(x=Type1.index, y=Type1, data=df, palette=types_color_dict, ax=ax)
BarT.set_xticklabels(BarT.get_xticklabels(), rotation= 90, fontsize=12)
BarT.set(ylabel = 'Freq')
BarT.set_title('Distribution of Primary Pokemon Types')
FigBar = BarT.get_figure()

type_distribution

We can see that the water and normal type pokemon are the most frequently appearing 'primary' types in the game.

Let's see how many pokemon are mono types vs dual-types so we can get a better sense of whether primary is sufficient.

labels = ['Mono type pokemon', 'Dual type pokemon']
sizes = [monotype, dualtype]
colors = ['lightskyblue', 'lightcoral']

patches, texts, _ = plt.pie(sizes, colors=colors, autopct='%1.1f%%', startangle=90, explode=(0,0.1))
plt.legend(patches, labels, loc="best")
plt.axis('equal')
plt.title('Dual-Type Ratio', fontsize=12)
plt.tight_layout()
plt.show()

mono_dual

Looks like there's actually more dual types than mono-types!

Aside from types, there are also 5 categories of pokemon: Regular, Pseudo-Legendary, Sub-Legendary, Legendary and Mythical. (There are of course also pre-evolutions, final evolutions, mega-evolutions etc.. but for the purposes of this analysis we will just bundle those together under 'regular' along with Pseudo-Legendary which are regular pokemon that have generally higher stats of 600 total. As for Sub Legendaries, Legendaries and Mythical - these pokemon typically exhibit 2 types of traits:

  1. Rarity: There is usually only 1 of those pokemon available in every game (some may not even be obtainable in certain games)
  2. Stats: These pokemon generally have much higher stats than the average 'regular' pokemon.

Let's create a diverging bar to determine the rate at which legendary pokemon appear in each generation:

#Sub-Legendary, Legendary or Mythical:
df.loc[df["is_sllm"]==False,"sllmid"] = 0
df.loc[df["is_sllm"]==True,"sllmid"] = 1

# calculate proportion of SL, L, M #
sllm_ratio = df.groupby("Gen").mean()["sllmid"]
sllm_ratio.round(4)*100
sns.set_style('darkgrid')
df_plot = pd.DataFrame(columns={"Gen","Rate","colors"})
x = sllm_ratio.values
df_plot["Gen"] = sllm_ratio.index
df_plot['Rate'] = (x - x.mean())/x.std()
df_plot['colors'] = ['red' if x < 0 else 'green' for x in df_plot['Rate']]
df_plot.sort_values('Rate', inplace=True)
df_plot.reset_index(inplace=True)

plt.figure(figsize=(14, 10))
plt.hlines(
    y=df_plot.index, xmin=0, xmax=df_plot.Rate,
    color=df_plot.colors,
    alpha=.4,
    linewidth=5)

plt.gca().set(xlabel='Rate', ylabel='Gen')
plt.yticks(df_plot.index, df_plot.Gen, fontsize=12)
plt.title('Diverging Bars of SubL, Legendary & Mythical Rate', fontdict={'size':20})
plt.show()

sub, legend,myth

Seems like Gen 7's Alola region has a huge volume of these 'legendaries & mythical' pokemon, which after digging further into it makes perfect sense given the introduction of a plethora of legendaries called 'ultra beasts' which were only ever introduced in that generation.

Correlations & Descriptive Statistics

Let's move to explore some correlations between stats.

#Correlation
Base_stats = ['Primary','Secondary','Classification','%Male','%Female',
              'Height','Weight','Capture_Rate','Base_Steps','HP','Attack','Defense',
              'SP_Attack','SP_Defense','Speed','is_sllm']

df_BS = df[Base_stats]
df_BS.head()
plt.figure(figsize=(14,12))

heatmap = sns.heatmap(df_BS.corr(), vmin=-1,vmax=1, annot=True, cmap='Blues')

heatmap.set_title('Correlation Base Stats Heatmap', fontdict={'fontsize':15}, pad=12)
plt.show()

correlation_plot

p1 = sns.jointplot(x="SP_Attack",y="SP_Defense",data=df,kind="hex",color="lightgreen")
p1.fig.suptitle("Hex Plot of Special Attack and Special Defense - Some Correlation")
p1.fig.subplots_adjust(top=0.95)
p2 = sns.jointplot(x="Defense",y="SP_Defense",data=df,kind="hex",color="lightblue")
p2.fig.suptitle("Hex Plot of Defense and Special Defense - Some Correlation")
p2.fig.subplots_adjust(top=0.95)
p3 = sns.jointplot(x="SP_Attack",y="Speed",data=df,kind="hex",color="pink")
p3.fig.suptitle("Hex Plot of Special Attack and Speed - Some Correlation")
p3.fig.subplots_adjust(top=0.95)
p4 = sns.jointplot(x="Attack",y="SP_Attack",data=df,kind="hex",color="orange")
p4.fig.suptitle("Hex Plot of Attack and Special Attack - Some Correlation")
p4.fig.subplots_adjust(top=0.95)
p5 = sns.jointplot(x="Attack",y="Defense",data=df,kind="hex",color="purple")
p5.fig.suptitle("Hex Plot of Attack and Defense - Some Correlation")
p5.fig.subplots_adjust(top=0.95)

hex_green hex_blue hex_red hex_orange hex_purple

from pandas import plotting
type1 = list(set(list(df['Primary'])))
cmap = plt.get_cmap('viridis')
colors = [cmap((type1.index(c) + 1) / (len(type1) + 2)) for c in df['Primary'].tolist()]
plotting.scatter_matrix(df.iloc[:, 13:18], figsize=(15, 15), color=colors, alpha=0.7) 
plt.show()

corrplot

import numpy as np
pd.DataFrame(np.corrcoef(df.iloc[:, 13:18].T.values.tolist()), 
             columns=df.iloc[:, 13:18].columns, index=df.iloc[:, 13:18].columns)

corrplot values

labels = ["Defense", "Attack"]
dims = (11.7, 8.27) #a4
fig, ax = plt.subplots(figsize=dims)
Defhist = sns.distplot(df['Defense'],color='g', hist=True, ax=ax)
Atthist = sns.distplot(df['Attack'],color='r', hist=True, ax=ax)
Atthist.set(title='Distribution of Defense & Attack')
plt.legend(labels, loc="best")
FigHist = Atthist.get_figure()

attack_defense

fig, ax = plt.subplots(2, 3, figsize=(14, 8), sharey=True)

spines = ["top","right","left"]
for i, col in enumerate(["HP", "Attack", "Defense", "SP_Attack", "SP_Defense", "Speed"]):
    sns.kdeplot(x=col, data=df, label=col, ax=ax[i//3][i%3],
                fill=True, color='lightblue', linewidth=2
               )
    
    ax[i//3][i%3].set_xlim(-5, 250)
    
    for s in spines:
        ax[i//3][i%3].spines[s].set_visible(False)
        

plt.tight_layout()
plt.show()

density_plots

df.describe()

std_dev_att_def

Looking at the summary statistics, we can see that the assumption about the variance and skewness of both plots was correct. The ‘std’ metric of the Attack is less than Defense, meaning that Defense statistics are more spread. Similarly, the Sp.Atk ‘std’ is larger than that of the Sp.Def. Skewness is determined by the positions of the median (50%) and the mean. Since in all instances (Attack, Defense, Sp.Attack and Sp.Defense) the mean is greater than the median, it is emphasised that the distribution is right-skewed (positively skewed).

Principal Component Analysis (PCA)

Let's analyze 800+ Pokemon as principal components and plot them in a two-dimensional plane using the first and second principal components. Principal component analysis (PCA) is a type of multivariate analysis method that is often used as a dimensionality reduction method.

In this data, the characteristics of 800+ Pokemon are represented by 6 types of "observed variables" (x1, x2, x3, x4, x5, x6). These 6 variables are used as explanatory variables. On the other hand, the synthetic variable synthesized by PCA is called "principal component score" and is given by a linear combination as shown in the following equation:

formula

In principal component analysis, the larger the eigenvalue (= variance of the principal component score), the more important the principal component score is. PCA is also sometimes regarded as a type of "unsupervised machine learning" and reveals the structure of the data itself. So let's start by importing PCA from Scikit-learn

from sklearn.decomposition import PCA
pca = PCA()
pca.fit(df.iloc[:, 13:18])
feature = pca.transform(df.iloc[:, 13:18])
plt.figure(figsize=(15, 15))
plt.scatter(feature[:, 0], feature[:, 1], alpha=0.8)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.grid()
plt.show()

PCA

import matplotlib.ticker as ticker
import numpy as np
plt.gca().get_xaxis().set_major_locator(ticker.MaxNLocator(integer=True))
plt.plot([0] + list( np.cumsum(pca.explained_variance_ratio_)), "-o")
plt.xlabel("Number of principal components")
plt.ylabel("Cumulative contribution ratio")
plt.grid()
plt.show()

components

Let's see if we can determine what makes a 'legendary' pokemon

pca = PCA()
pca.fit(df.iloc[:, 13:18])
feature = pca.transform(df.iloc[:, 13:18])
plt.figure(figsize=(15, 15))
for binary in [True, False]:
    plt.scatter(feature[df['is_sllm'] == binary, 0], feature[df['is_sllm'] == binary, 1], alpha=0.8, label=binary)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.legend(loc = 'best')
plt.grid()
plt.show()

pca_color

Nice! Although it's not 'exact' we can clearly see that when the first principal component (PC1) reaches 50, we start to see a significantly higher concentration of legendary pokemon! Now, let's illustrate how much PC1 actually contributes to the explanatory variable (parameter) with a loading plot.

components_stats

Assuming that the first principal component (PC1) is actually a strong indicator of whether or not a pokemon is classified as legendary, sub-legendary or mythical, it seems like Special Attack is the best indicator out of all stats (follow by Physical Attack)

In the second principal component (PC2), Defense and Speed contribute to the opposite: Positive & Negative.

"Factor Analysis" is a method that is similar to principal component analysis.

In PCA, we synthesized the "principal component" yPC1 which is a linear combination of the weight matrix (eigenvector) a for the explanatory variables. Here, define as many principal components as there are explanatory variables.

yPC1 = a1,1 x1 + a1,2 x2 + a1,3 x3 + a1,4 x4 + a1,5 + ...

In factor analysis, based on the idea that the explanatory variable (observed variable) x is synthesized from a latent variable called "factor", the factor score f, the weight matrix (factor load) w, and the unique factor e are specified. (There is no idea of ​​a unique factor in principal component analysis).

x1 = w1,1 f1 + w1,2 f2 + e1

x2 = w2,1 f1 + w2,2 f2 + e2

x3 = w3,1 f1 + w3,2 f2 + e3

x4 = w4,1 f1 + w4,2 f2 + e4

x5 = w5,1 f1 + w5,2 f2 + e5

x6 = w6,1 f1 + w6,2 f2 + e6

The factor score f is a latent variable unique to each individual (sample). The linear sum of the factor score and the factor load (w1,1 f1 + w1,2 f2, etc.) is called the "common factor" and can be observed as an "observed variable" by adding it to the "unique factor" e unique to the observed variable. It's a way of thinking. The number of factors is usually smaller than the explanatory variables and must be decided in advance.

(However, terms such as common factors and factors are very confusing because it seems that different people have different definitions as far as I can see)

from sklearn.decomposition import FactorAnalysis
fa = FactorAnalysis(n_components=2, max_iter=500)
factors = fa.fit_transform(df.iloc[:, 13:18])
plt.figure(figsize=(12, 12))
for binary in [True, False]:
    plt.scatter(factors[df['is_sllm'] == binary, 0], factors[df['is_sllm'] == binary, 1], alpha=0.8, label=binary)
plt.xlabel("Factor 1")
plt.ylabel("Factor 2")
plt.legend(loc = 'best')
plt.grid()
plt.show()

pca_color2

In this instance, the determining factor of a 'legendary' is whether or not the sum of factor 1 and factor 2 exceeds a certain level, but it seems that it is slightly biased toward the larger factor 2. So which parameters do factor 2 and factor 1 allude to?

plt.figure(figsize=(8, 8))
for x, y, name in zip(fa.components_[0], fa.components_[1], df.columns[13:18]):
    plt.text(x, y, name)
plt.scatter(fa.components_[0], fa.components_[1])
plt.grid()
plt.xlabel("Factor 1")
plt.ylabel("Factor 2")
plt.show()

component_stats(factor2)

Factor 1 highest value = "Defense" Factor 2 highest value = "Special Attack"

Let's create some charts!

Firstly I created a dendrogram (dendro = greek word for tree :)) for all pokemon (Image file is way too large to display clearly)

dfs = df.iloc[:, 13:18].apply(lambda x: (x-x.mean())/x.std(), axis=0)
from scipy.cluster.hierarchy import linkage, dendrogram
result1 = linkage(dfs, 
                  metric = 'euclidean', 
                  method = 'average')
plt.figure(figsize=(15, 150))
dendrogram(result1, orientation='right', labels=list(df['Name']), color_threshold=2)
plt.title("Dedrogram of Pokemon")
plt.xlabel("Threshold")
plt.grid()
plt.show()
def get_cluster_by_number(result, number):
    output_clusters = []
    x_result, y_result = result.shape
    n_clusters = x_result + 1
    cluster_id = x_result + 1
    father_of = {}
    x1 = []
    y1 = []
    x2 = []
    y2 = []
    for i in range(len(result) - 1):
        n1 = int(result[i][0])
        n2 = int(result[i][1])
        val = result[i][2]
        n_clusters -= 1
        if n_clusters >= number:
            father_of[n1] = cluster_id
            father_of[n2] = cluster_id

        cluster_id += 1

    cluster_dict = {}
    for n in range(x_result + 1):
        if n not in father_of:
            output_clusters.append([n])
            continue

        n2 = n
        m = False
        while n2 in father_of:
            m = father_of[n2]
            #print [n2, m]
            n2 = m

        if m not in cluster_dict:
            cluster_dict.update({m:[]})
        cluster_dict[m].append(n)

    output_clusters += cluster_dict.values()

    output_cluster_id = 0
    output_cluster_ids = [0] * (x_result + 1)
    for cluster in sorted(output_clusters):
        for i in cluster:
            output_cluster_ids[i] = output_cluster_id
        output_cluster_id += 1

    return output_cluster_ids
clusterIDs = get_cluster_by_number(result1, 50)
print(clusterIDs)

cluster_ids

plt.hist(clusterIDs, bins=50)
plt.show()

histo

Here we've created a histogram of clusters of pokemon that exhibit similar traits with each other. Here we've created 50 bins so there will be 50 different clusters of pokemon. That's quite a large number of charts to display so I'll just display several so you get the idea.

cluster4 cluster5 cluster6 cluster8 cluster10 cluster50

Some pokemon exhibit lots of traits similar to each other while others (like Regieleki) stand out.

Cross Validation & Regression Analysis

Since we saw earlier that Special Attack is a huge contributing factor to determining whether a pokemon is classified as 'legendary', let's use the rest of the stats to see if we can predict Special Attack!

X = df.iloc[:, 13:18]
y = df['total']
from sklearn import linear_model
regr = linear_model.LinearRegression()
regr.fit(X, y)

print("Regression Coefficient= ", regr.coef_)
print("Intercept= ", regr.intercept_)
print("Coefficient of Determination= ", regr.score(X, y))
df.columns[[12, 13, 14, 16, 17]]
X = df.iloc[:, [12, 13, 14, 16, 17]]
y = df['SP_Attack']

Cross Validation

In machine learning, in order to evaluate performance, known data is divided into training and test data. Training (learning) is performed using training data to build a prediction model, and performance evaluation is performed based on how accurately the test data that was not used to build the prediction model can be predicted. Such an evaluation method is called "cross-validation".

Training data (60% of all data) X_train: Explanatory variables for training data y_train: Objective variable for training data Test data (40% of all data) X_test: Explanatory variable for test data y_test: Objective variable for test data We aim to learn the relationship between X_train and y_train and predict y_test from X_test. If the training data seems to show good performance, but the test data not used for training has poor performance, the model is said to be "overfitted".

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)
from sklearn import linear_model
regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)
print("Regression Coefficient= ", regr.coef_)
print("Intercept= ", regr.intercept_)
print("Coefficient of Determination(train)= ", regr.score(X_train, y_train))
print("Coefficient of Determination(test)= ", regr.score(X_test, y_test))
Regression Coefficient=  [ 0.15598049  0.09796333 -0.11115187  0.47986509  0.32513351]
Intercept=  5.4684249031776915
Coefficient of Determination(train)=  0.39594357153305826
Coefficient of Determination(test)=  0.38127048972638855

The above values change with each calculation because the division into training data and test data is random. If you want to find a regression equation, you can do as above, but by standardizing the explanatory variables and objective variables and then regressing, you can find the "standard regression coefficient", which is an index of "importance of variables".

Xs = X.apply(lambda x: (x-x.mean())/x.std(), axis=0)
ys = list(pd.DataFrame(y).apply(lambda x: (x-x.mean())/x.std()).values.reshape(len(y),))
from sklearn import linear_model
regr = linear_model.LinearRegression()

regr.fit(Xs, ys)

print("Regression Coefficient= ", regr.coef_)
print("Intercept= ", regr.intercept_)
print("Coefficient of Determination= ", regr.score(Xs, ys))
Regression Coefficient=  [ 0.152545    0.11255532 -0.09718819  0.40725508  0.28208903]
Intercept=  1.1730486200365748e-16
Coefficient of Determination=  0.3958130072204933
pd.DataFrame(regr.coef_, index=list(df.columns[[12, 13, 14, 16, 17]])).sort_values(0, ascending=False).style.bar(subset=[0])

sp attack prediction

It seems that Special Defense & Speed are very important in predicting "Special Attack"

Conclusion

Regression analysis, such as multiple regression analysis, uses numerical data as an explanatory variable and predicts numerical data as an objective variable. On the other hand, quantification type I predicts using non-numeric categorical data as an explanatory variable and numerical data as an objective variable. When the explanatory variables are a mixture of numerical data and categorical data, they are called extended quantification type I.

We saw that Special Attack is definitely a strong predictor for determining whether a pokemon is legendary or not - and we also saw that Special Defense & Speed are also important indicators of Special Attack Value.

Overall this was a way of exploring different pokemon traits and taking into account multiple factors. There's plenty more we can look into such as 'strengths', 'weaknesses' etc.. I hope you all enjoyed this, and thanks for reading all the way through!

Owner
Andreas
I love all kinds of data, though you will most likely see a heavier concentration of insights on categories like gaming, movies and sports
Andreas
Collections of pydantic models

pydantic-collections The pydantic-collections package provides BaseCollectionModel class that allows you to manipulate collections of pydantic models

Roman Snegirev 20 Dec 26, 2022
DataPrep — The easiest way to prepare data in Python

DataPrep — The easiest way to prepare data in Python

SFU Database Group 1.5k Dec 27, 2022
A columnar data container that can be compressed.

Unmaintained Package Notice Unfortunately, and due to lack of resources, the Blosc Development Team is unable to maintain this package anymore. During

944 Dec 09, 2022
songplays datamart provide details about the musical taste of our customers and can help us to improve our recomendation system

Songplays User activity datamart The following document describes the model used to build the songplays datamart table and the respective ETL process.

Leandro Kellermann de Oliveira 1 Jul 13, 2021
Methylation/modified base calling separated from basecalling.

Remora Methylation/modified base calling separated from basecalling. Remora primarily provides an API to call modified bases for basecaller programs s

Oxford Nanopore Technologies 72 Jan 05, 2023
Cold Brew: Distilling Graph Node Representations with Incomplete or Missing Neighborhoods

Cold Brew: Distilling Graph Node Representations with Incomplete or Missing Neighborhoods Introduction Graph Neural Networks (GNNs) have demonstrated

37 Dec 15, 2022
High Dimensional Portfolio Selection with Cardinality Constraints

High-Dimensional Portfolio Selecton with Cardinality Constraints This repo contains code for perform proximal gradient descent to solve sample average

Du Jinhong 2 Mar 22, 2022
Data Competition: automated systems that can detect whether people are not wearing masks or are wearing masks incorrectly

Table of contents Introduction Dataset Model & Metrics How to Run Quickstart Install Training Evaluation Detection DATA COMPETITION The COVID-19 pande

Thanh Dat Vu 1 Feb 27, 2022
Instant search for and access to many datasets in Pyspark.

SparkDataset Provides instant access to many datasets right from Pyspark (in Spark DataFrame structure). Drop a star if you like the project. 😃 Motiv

Souvik Pratiher 31 Dec 16, 2022
Data imputations library to preprocess datasets with missing data

Impyute is a library of missing data imputation algorithms. This library was designed to be super lightweight, here's a sneak peak at what impyute can do.

Elton Law 329 Dec 05, 2022
Probabilistic Programming in Python: Bayesian Modeling and Probabilistic Machine Learning with Theano

PyMC3 is a Python package for Bayesian statistical modeling and Probabilistic Machine Learning focusing on advanced Markov chain Monte Carlo (MCMC) an

PyMC 7.2k Dec 30, 2022
Processo de ETL (extração, transformação, carregamento) realizado pela equipe no projeto final do curso da Soul Code Academy.

Processo de ETL (extração, transformação, carregamento) realizado pela equipe no projeto final do curso da Soul Code Academy.

Débora Mendes de Azevedo 1 Feb 03, 2022
Vaex library for Big Data Analytics of an Airline dataset

Vaex-Big-Data-Analytics-for-Airline-data A Python notebook (ipynb) created in Jupyter Notebook, which utilizes the Vaex library for Big Data Analytics

Nikolas Petrou 1 Feb 13, 2022
This repository contains some analysis of possible nerdle answers

Nerdle Analysis https://nerdlegame.com/ This repository contains some analysis of possible nerdle answers. Here's a quick overview: nerdle.py contains

0 Dec 16, 2022
The lastest all in one bombing tool coded in python uses tbomb api

BaapG-Attack is a python3 based script which is officially made for linux based distro . It is inbuit mass bomber with sms, mail, calls and many more bombing

59 Dec 25, 2022
General Assembly's 2015 Data Science course in Washington, DC

DAT8 Course Repository Course materials for General Assembly's Data Science course in Washington, DC (8/18/15 - 10/29/15). Instructor: Kevin Markham (

Kevin Markham 1.6k Jan 07, 2023
Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

AWS Data Wrangler Pandas on AWS Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretMana

Amazon Web Services - Labs 3.3k Jan 04, 2023
Very useful and necessary functions that simplify working with data

Additional-function-for-pandas Very useful and necessary functions that simplify working with data random_fill_nan(module_name, nan) - Replaces all sp

Alexander Goldian 2 Dec 02, 2021
Intake is a lightweight package for finding, investigating, loading and disseminating data.

Intake: A general interface for loading data Intake is a lightweight set of tools for loading and sharing data in data science projects. Intake helps

Intake 851 Jan 01, 2023
This repo contains a simple but effective tool made using python which can be used for quality control in statistical approach.

This repo contains a powerful tool made using python which is used to visualize, analyse and finally assess the quality of the product depending upon the given observations

SasiVatsal 8 Oct 18, 2022