Kaggler is a Python package for lightweight online machine learning algorithms and utility functions for ETL and data analysis.

Overview

PyPI version CI Downloads codecov

Kaggler

Kaggler is a Python package for lightweight online machine learning algorithms and utility functions for ETL and data analysis. It is distributed under the MIT License.

Its online learning algorithms are inspired by Kaggle user tinrtgu's code. It uses the sparse input format that handles large sparse data efficiently. Core code is optimized for speed by using Cython.

Installation

Dependencies

Python packages required are listed in requirements.txt

  • cython
  • h5py
  • hyperopt
  • lightgbm
  • ml_metrics
  • numpy/scipy
  • pandas
  • scikit-learn

Using pip

Python package is available at PyPi for pip installation:

pip install -U Kaggler

If installation fails because it cannot find MurmurHash3.h, please add . to LD_LIBRARY_PATH as described here.

From source code

If you want to install it from source code:

python setup.py build_ext --inplace
python setup.py install

Feature Engineering

One-Hot, Label, Target, Frequency, and Embedding Encoders for Categorical Features

import pandas as pd
from kaggler.preprocessing import OneHotEncoder, LabelEncoder, TargetEncoder, FrequencyEncoder, EmbeddingEncoder

trn = pd.read_csv('train.csv')
target_col = trn.columns[-1]
cat_cols = [col for col in trn.columns if trn[col].dtype == 'object']

ohe = OneHotEncoder(min_obs=100) # grouping all categories with less than 100 occurences
lbe = LabelEncoder(min_obs=100)  # grouping all categories with less than 100 occurences
te = TargetEncoder()			 # replacing each category with the average target value of the category
fe = FrequencyEncoder()	         # replacing each category with the frequency value of the category
ee = EmbeddingEncoder()          # mapping each category to a vector of real numbers

X_ohe = ohe.fit_transform(trn[cat_cols])	    # X_ohe is a scipy sparse matrix
trn[cat_cols] = lbe.fit_transform(trn[cat_cols])
trn[cat_cols] = te.fit_transform(trn[cat_cols])
trn[cat_cols] = fe.fit_transform(trn[cat_cols])
X_ee = ee.fit_transform(trn[cat_cols], trn[target_col])          # X_ee is a numpy matrix

tst = pd.read_csv('test.csv')
X_ohe = ohe.transform(tst[cat_cols])
tst[cat_cols] = lbe.transform(tst[cat_cols])
tst[cat_cols] = te.transform(tst[cat_cols])
tst[cat_cols] = fe.transform(tst[cat_cols])
X_ee = ee.transform(tst[cat_cols])

Denoising AutoEncoder (DAE)

For reference for DAE, please check out Vincent et al. (2010), "Stacked Denoising Autoencoders".

import pandas as pd
from kaggler.preprocessing import DAE

trn = pd.read_csv('train.csv')
tst = pd.read_csv('test.csv')
target_col = trn.columns[-1]
cat_cols = [col for col in trn.columns if trn[col].dtype == 'object']
num_cols = [col for col in trn.columns if col not in cat_cols + [target_col]]

# Default DAE with only the swapping noise and a single encoder/decoder pair.
dae = DAE(cat_cols=cat_cols, num_cols=num_cols, n_encoding=128)
X = dae.fit_transform(pd.concat([trn, tst], axis=0))    # encoding input features into the encoding vectors with size of 128

# Stacked DAE with the Gaussian noise, swapping noise and zero masking in 3 pairs of the encoder/decoder.
sdae = DAE(cat_cols=cat_cols, num_cols=num_cols, n_encoding=128, n_layer=3,
           noise_std=.05, swap_prob=.2, mask_prob=.1)
X = sdae.fit_transform(pd.concat([trn, tst], axis=0))

# Supervised DAE with the Gaussian noise, swapping noise and zero masking in 3 encoders in the encoder/decoder pair.
sdae = SDAE(cat_cols=cat_cols, num_cols=num_cols, n_encoding=128, n_encoder=3,
           noise_std=.05, swap_prob=.2, mask_prob=.1)
X = sdae.fit_transform(trn, trn[target_col])

AutoML

Feature Selection & Hyperparameter Tuning

import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from kaggler.metrics import auc
from kaggler.model import AutoLGB


RANDOM_SEED = 42
N_OBS = 10000
N_FEATURE = 100
N_IMP_FEATURE = 20

X, y = make_classification(n_samples=N_OBS,
                            n_features=N_FEATURE,
                            n_informative=N_IMP_FEATURE,
                            random_state=RANDOM_SEED)
X = pd.DataFrame(X, columns=['x{}'.format(i) for i in range(X.shape[1])])
y = pd.Series(y)

X_trn, X_tst, y_trn, y_tst = train_test_split(X, y,
                                                test_size=.2,
                                                random_state=RANDOM_SEED)

model = AutoLGB(objective='binary', metric='auc')
model.tune(X_trn, y_trn)
model.fit(X_trn, y_trn)
p = model.predict(X_tst)
print('AUC: {:.4f}'.format(auc(y_tst, p)))

Ensemble

Netflix Blending

import numpy as np
from kaggler.ensemble import netflix
from kaggler.metrics import rmse

# Load the predictions of input models for ensemble
p1 = np.loadtxt('model1_prediction.txt')
p2 = np.loadtxt('model2_prediction.txt')
p3 = np.loadtxt('model3_prediction.txt')

# Calculate RMSEs of model predictions and all-zero prediction.
# At a competition, RMSEs (or RMLSEs) of submissions can be used.
y = np.loadtxt('target.txt')
e0 = rmse(y, np.zeros_like(y))
e1 = rmse(y, p1)
e2 = rmse(y, p2)
e3 = rmse(y, p3)

p, w = netflix([e1, e2, e3], [p1, p2, p3], e0, l=0.0001) # l is an optional regularization parameter.

Algorithms

Currently algorithms available are as follows:

Online learning algorithms

  • Stochastic Gradient Descent (SGD)
  • Follow-the-Regularized-Leader (FTRL)
  • Factorization Machine (FM)
  • Neural Networks (NN) - with a single (NN) or two (NN_H2) ReLU hidden layers
  • Decision Tree

Batch learning algorithm

  • Neural Networks (NN) - with a single hidden layer and L-BFGS optimization

Examples

from kaggler.online_model import SGD, FTRL, FM, NN

# SGD
clf = SGD(a=.01,                # learning rate
          l1=1e-6,              # L1 regularization parameter
          l2=1e-6,              # L2 regularization parameter
          n=2**20,              # number of hashed features
          epoch=10,             # number of epochs
          interaction=True)     # use feature interaction or not

# FTRL
clf = FTRL(a=.1,                # alpha in the per-coordinate rate
           b=1,                 # beta in the per-coordinate rate
           l1=1.,               # L1 regularization parameter
           l2=1.,               # L2 regularization parameter
           n=2**20,             # number of hashed features
           epoch=1,             # number of epochs
           interaction=True)    # use feature interaction or not

# FM
clf = FM(n=1e5,                 # number of features
         epoch=100,             # number of epochs
         dim=4,                 # size of factors for interactions
         a=.01)                 # learning rate

# NN
clf = NN(n=1e5,                 # number of features
         epoch=10,              # number of epochs
         h=16,                  # number of hidden units
         a=.1,                  # learning rate
         l2=1e-6)               # L2 regularization parameter

# online training and prediction directly with a libsvm file
for x, y in clf.read_sparse('train.sparse'):
    p = clf.predict_one(x)      # predict for an input
    clf.update_one(x, p - y)    # update the model with the target using error

for x, _ in clf.read_sparse('test.sparse'):
    p = clf.predict_one(x)

# online training and prediction with a scipy sparse matrix
from kaggler import load_data

X, y = load_data('train.sps')

clf.fit(X, y)
p = clf.predict(X)

Data I/O

Kaggler supports CSV (.csv), LibSVM (.sps), and HDF5 (.h5) file formats:

# CSV format: target,feature1,feature2,...
1,1,0,0,1,0.5
0,0,1,0,0,5

# LibSVM format: target feature-index1:feature-value1 feature-index2:feature-value2
1 1:1 4:1 5:0.5
0 2:1 5:1

# HDF5
- issparse: binary flag indicating whether it stores sparse data or not.
- target: stores a target variable as a numpy.array
- shape: available only if issparse == 1. shape of scipy.sparse.csr_matrix
- indices: available only if issparse == 1. indices of scipy.sparse.csr_matrix
- indptr: available only if issparse == 1. indptr of scipy.sparse.csr_matrix
- data: dense feature matrix if issparse == 0 else data of scipy.sparse.csr_matrix
from kaggler.data_io import load_data, save_data

X, y = load_data('train.csv')	# use the first column as a target variable
X, y = load_data('train.h5')	# load the feature matrix and target vector from a HDF5 file.
X, y = load_data('train.sps')	# load the feature matrix and target vector from LibSVM file.

save_data(X, y, 'train.csv')
save_data(X, y, 'train.h5')
save_data(X, y, 'train.sps')

Documentation

Package documentation is available at here

Comments
  • make FTRL more c-style and faster

    make FTRL more c-style and faster

    It's about 10% ~ 30% faster when interaction=False.

    You may use the following script to profile the performance. But before compiling, add # cython: linetrace=True at the header of ftrl.pyx.

    import cProfile
    
    import numpy as np
    np.random.seed(1234)
    import scipy.sparse as sps
    
    from kaggler.online_model import FTRL
    
    
    DATA_NUM = 5e7
    
    
    class customCSR(object):
        def __init__(self, csr_matrix):
            self.data = []
            self.shape = csr_matrix.shape
            for row in range(self.shape[0]):
                self.data.append(csr_matrix[row])
        def __getitem__(self, idx):
            return self.data[idx]
    
    
    def main():
        print('create y...')
        y = np.random.randint(0, 1, DATA_NUM)
        print('create x...')
        row = np.random.randint(0, 100000, DATA_NUM)
        col = np.random.randint(0, 10, DATA_NUM)
        data = np.ones(DATA_NUM)
        x = sps.csr_matrix((data, (row, col)), dtype=np.int8)
        x = customCSR(x)
        
        print('train...')
        profiler = cProfile.Profile(subcalls=True, builtins=True, timeunit=0.001,)
        clf = FTRL(interaction=False)
        profiler.enable()
        clf.fit(x, y)
        profiler.disable()
        profiler.print_stats()
        print(clf.predict(x))
    
    
    if __name__ == '__main__':
        main()
    
    opened by stegben 8
  • Set embedding layer to n_uniq + 1

    Set embedding layer to n_uniq + 1

    Hi @jeongyoonlee, I am getting following error when using EmbeddingEncoder()

    InvalidArgumentError: indices[389,0] = 3 is not in [0, 3) [[{{node prior_rider_segment_emb/embedding_lookup}}]]

    I might be wrong but I think this happened because index starting from 0. Could we set the embedding layer to n_uniq +1 that will handle the out of bound index error.

    opened by ppstacy 7
  • Use MurmurHash3 for interaction features

    Use MurmurHash3 for interaction features

    #23 MurmurHash3 is used by sklearn's FeatureHash, which is fast and robust enough.

    Before

             3 function calls in 9.690 seconds
    
       Ordered by: standard name
    
       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
            1    0.000    0.000    0.000    0.000 base.py:99(get_shape)
            1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
            1    9.690    9.690    9.690    9.690 {method 'fit' of 'kaggler.online_model.ftrl.FTRL' objects}
    
    

    After

             3 function calls in 2.265 seconds
    
       Ordered by: standard name
    
       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
            1    0.000    0.000    0.000    0.000 base.py:99(get_shape)
            1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
            1    2.265    2.265    2.265    2.265 {method 'fit' of 'kaggler.online_model.ftrl.FTRL' objects}
    
    
    opened by stegben 6
  • macos pip install failure

    macos pip install failure

    Within a Python 3.6 environment (managed by conda), when I run pip install Kaggler, pip install -U Kaggler, or pip install --no-cache-dir Kaggler I get the following error message (see below). I also ran pip install -U cython beforehand to update cython, but same error msg occurs.

    (python3.6) mike-yung$ pip install -no-cache-dir Kaggler
    
    Usage:
      pip install [options] <requirement specifier> [package-index-options] ...
      pip install [options] -r <requirements file> [package-index-options] ...
      pip install [options] [-e] <vcs project url> ...
      pip install [options] [-e] <local project path> ...
      pip install [options] <archive url/path> ...
    
    no such option: -n
    (python3.6) mike-yung-C02WC0F4HTDG:ltvent mike.yung$ pip install --no-cache-dir Kaggler
    Looking in indexes: https://yoober7:****@pypi.uberinternal.com/index, https://pypi.python.org/simple
    Collecting Kaggler
      Downloading https://pypi.uberinternal.com/packages/af/98/25d2c773369ba56b2e70e584f5ab4ab1ed1708df6ec8dcc153d77f03607e/Kaggler-0.6.9.tar.gz (812kB)
        100% |████████████████████████████████| 819kB 14.3MB/s
    Requirement already satisfied: setuptools>=41.0.0 in /anaconda2/envs/python3.6/lib/python3.6/site-packages (from Kaggler) (41.0.1)
    Requirement already satisfied: cython in /anaconda2/envs/python3.6/lib/python3.6/site-packages (from Kaggler) (0.29.7)
    Requirement already satisfied: h5py in /anaconda2/envs/python3.6/lib/python3.6/site-packages (from Kaggler) (2.9.0)
    Requirement already satisfied: ml_metrics in /anaconda2/envs/python3.6/lib/python3.6/site-packages (from Kaggler) (0.1.4)
    Requirement already satisfied: numpy in /anaconda2/envs/python3.6/lib/python3.6/site-packages (from Kaggler) (1.16.2)
    Requirement already satisfied: pandas in /anaconda2/envs/python3.6/lib/python3.6/site-packages (from Kaggler) (0.24.2)
    Requirement already satisfied: matplotlib in /anaconda2/envs/python3.6/lib/python3.6/site-packages (from Kaggler) (2.2.4)
    Requirement already satisfied: scipy>=0.14.0 in /anaconda2/envs/python3.6/lib/python3.6/site-packages (from Kaggler) (1.2.1)
    Requirement already satisfied: scikit-learn>=0.15.0 in /anaconda2/envs/python3.6/lib/python3.6/site-packages (from Kaggler) (0.20.3)
    Requirement already satisfied: statsmodels>=0.5.0 in /anaconda2/envs/python3.6/lib/python3.6/site-packages (from Kaggler) (0.9.0)
    Requirement already satisfied: kaggle in /anaconda2/envs/python3.6/lib/python3.6/site-packages (from Kaggler) (1.5.3)
    Requirement already satisfied: tensorflow in /anaconda2/envs/python3.6/lib/python3.6/site-packages (from Kaggler) (1.13.1)
    Requirement already satisfied: keras in /anaconda2/envs/python3.6/lib/python3.6/site-packages (from Kaggler) (2.2.4)
    Requirement already satisfied: six in /anaconda2/envs/python3.6/lib/python3.6/site-packages (from h5py->Kaggler) (1.12.0)
    Requirement already satisfied: pytz>=2011k in /anaconda2/envs/python3.6/lib/python3.6/site-packages (from pandas->Kaggler) (2018.9)
    Requirement already satisfied: python-dateutil>=2.5.0 in /anaconda2/envs/python3.6/lib/python3.6/site-packages (from pandas->Kaggler) (2.8.0)
    Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /anaconda2/envs/python3.6/lib/python3.6/site-packages (from matplotlib->Kaggler) (2.3.1)
    Requirement already satisfied: kiwisolver>=1.0.1 in /anaconda2/envs/python3.6/lib/python3.6/site-packages (from matplotlib->Kaggler) (1.0.1)
    Requirement already satisfied: cycler>=0.10 in /anaconda2/envs/python3.6/lib/python3.6/site-packages (from matplotlib->Kaggler) (0.10.0)
    Requirement already satisfied: requests in /anaconda2/envs/python3.6/lib/python3.6/site-packages (from kaggle->Kaggler) (2.21.0)
    Requirement already satisfied: urllib3<1.25,>=1.21.1 in /anaconda2/envs/python3.6/lib/python3.6/site-packages (from kaggle->Kaggler) (1.24.1)
    Requirement already satisfied: python-slugify in /anaconda2/envs/python3.6/lib/python3.6/site-packages (from kaggle->Kaggler) (3.0.2)
    Requirement already satisfied: certifi in /anaconda2/envs/python3.6/lib/python3.6/site-packages (from kaggle->Kaggler) (2019.3.9)
    Requirement already satisfied: tqdm in /anaconda2/envs/python3.6/lib/python3.6/site-packages (from kaggle->Kaggler) (4.32.1)
    Requirement already satisfied: wheel>=0.26 in /anaconda2/envs/python3.6/lib/python3.6/site-packages (from tensorflow->Kaggler) (0.33.1)
    Requirement already satisfied: tensorboard<1.14.0,>=1.13.0 in /anaconda2/envs/python3.6/lib/python3.6/site-packages (from tensorflow->Kaggler) (1.13.1)
    Requirement already satisfied: gast>=0.2.0 in /anaconda2/envs/python3.6/lib/python3.6/site-packages (from tensorflow->Kaggler) (0.2.2)
    Requirement already satisfied: tensorflow-estimator<1.14.0rc0,>=1.13.0 in /anaconda2/envs/python3.6/lib/python3.6/site-packages (from tensorflow->Kaggler) (1.13.0)
    Requirement already satisfied: astor>=0.6.0 in /anaconda2/envs/python3.6/lib/python3.6/site-packages (from tensorflow->Kaggler) (0.7.1)
    Requirement already satisfied: termcolor>=1.1.0 in /anaconda2/envs/python3.6/lib/python3.6/site-packages (from tensorflow->Kaggler) (1.1.0)
    Requirement already satisfied: keras-preprocessing>=1.0.5 in /anaconda2/envs/python3.6/lib/python3.6/site-packages (from tensorflow->Kaggler) (1.0.9)
    Requirement already satisfied: protobuf>=3.6.1 in /anaconda2/envs/python3.6/lib/python3.6/site-packages (from tensorflow->Kaggler) (3.7.1)
    Requirement already satisfied: grpcio>=1.8.6 in /anaconda2/envs/python3.6/lib/python3.6/site-packages (from tensorflow->Kaggler) (1.20.1)
    Requirement already satisfied: absl-py>=0.1.6 in /anaconda2/envs/python3.6/lib/python3.6/site-packages (from tensorflow->Kaggler) (0.7.1)
    Requirement already satisfied: keras-applications>=1.0.6 in /anaconda2/envs/python3.6/lib/python3.6/site-packages (from tensorflow->Kaggler) (1.0.7)
    Requirement already satisfied: pyyaml in /anaconda2/envs/python3.6/lib/python3.6/site-packages (from keras->Kaggler) (5.1)
    Requirement already satisfied: idna<2.9,>=2.5 in /anaconda2/envs/python3.6/lib/python3.6/site-packages (from requests->kaggle->Kaggler) (2.8)
    Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /anaconda2/envs/python3.6/lib/python3.6/site-packages (from requests->kaggle->Kaggler) (3.0.4)
    Requirement already satisfied: text-unidecode==1.2 in /anaconda2/envs/python3.6/lib/python3.6/site-packages (from python-slugify->kaggle->Kaggler) (1.2)
    Requirement already satisfied: werkzeug>=0.11.15 in /anaconda2/envs/python3.6/lib/python3.6/site-packages (from tensorboard<1.14.0,>=1.13.0->tensorflow->Kaggler) (0.14.1)
    Requirement already satisfied: markdown>=2.6.8 in /anaconda2/envs/python3.6/lib/python3.6/site-packages (from tensorboard<1.14.0,>=1.13.0->tensorflow->Kaggler) (3.1)
    Requirement already satisfied: mock>=2.0.0 in /anaconda2/envs/python3.6/lib/python3.6/site-packages (from tensorflow-estimator<1.14.0rc0,>=1.13.0->tensorflow->Kaggler) (3.0.5)
    Installing collected packages: Kaggler
      Running setup.py install for Kaggler ... error
        Complete output from command /anaconda2/envs/python3.6/bin/python -u -c "import setuptools, tokenize;__file__='/private/var/folders/1y/btgkmt992l94_1d37rkvhc380000gn/T/pip-install-ttr3it94/Kaggler/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /private/var/folders/1y/btgkmt992l94_1d37rkvhc380000gn/T/pip-record-g7a_hyv1/install-record.txt --single-version-externally-managed --compile:
        running install
        running build
        running build_py
        creating build
        creating build/lib.macosx-10.7-x86_64-3.6
        creating build/lib.macosx-10.7-x86_64-3.6/kaggler
        copying kaggler/data_io.py -> build/lib.macosx-10.7-x86_64-3.6/kaggler
        copying kaggler/__init__.py -> build/lib.macosx-10.7-x86_64-3.6/kaggler
        copying kaggler/const.py -> build/lib.macosx-10.7-x86_64-3.6/kaggler
        creating build/lib.macosx-10.7-x86_64-3.6/kaggler/feature_selection
        copying kaggler/feature_selection/feature_selection.py -> build/lib.macosx-10.7-x86_64-3.6/kaggler/feature_selection
        copying kaggler/feature_selection/__init__.py -> build/lib.macosx-10.7-x86_64-3.6/kaggler/feature_selection
        creating build/lib.macosx-10.7-x86_64-3.6/kaggler/ensemble
        copying kaggler/ensemble/linear.py -> build/lib.macosx-10.7-x86_64-3.6/kaggler/ensemble
        copying kaggler/ensemble/__init__.py -> build/lib.macosx-10.7-x86_64-3.6/kaggler/ensemble
        creating build/lib.macosx-10.7-x86_64-3.6/kaggler/model
        copying kaggler/model/nn.py -> build/lib.macosx-10.7-x86_64-3.6/kaggler/model
        copying kaggler/model/__init__.py -> build/lib.macosx-10.7-x86_64-3.6/kaggler/model
        creating build/lib.macosx-10.7-x86_64-3.6/kaggler/metrics
        copying kaggler/metrics/regression.py -> build/lib.macosx-10.7-x86_64-3.6/kaggler/metrics
        copying kaggler/metrics/classification.py -> build/lib.macosx-10.7-x86_64-3.6/kaggler/metrics
        copying kaggler/metrics/__init__.py -> build/lib.macosx-10.7-x86_64-3.6/kaggler/metrics
        creating build/lib.macosx-10.7-x86_64-3.6/kaggler/online_model
        copying kaggler/online_model/__init__.py -> build/lib.macosx-10.7-x86_64-3.6/kaggler/online_model
        creating build/lib.macosx-10.7-x86_64-3.6/kaggler/preprocessing
        copying kaggler/preprocessing/__init__.py -> build/lib.macosx-10.7-x86_64-3.6/kaggler/preprocessing
        copying kaggler/preprocessing/data.py -> build/lib.macosx-10.7-x86_64-3.6/kaggler/preprocessing
        creating build/lib.macosx-10.7-x86_64-3.6/kaggler/test
        copying kaggler/test/test_sgd.py -> build/lib.macosx-10.7-x86_64-3.6/kaggler/test
        copying kaggler/test/test_ftrl.py -> build/lib.macosx-10.7-x86_64-3.6/kaggler/test
        copying kaggler/test/test_lbe.py -> build/lib.macosx-10.7-x86_64-3.6/kaggler/test
        copying kaggler/test/__init__.py -> build/lib.macosx-10.7-x86_64-3.6/kaggler/test
        copying kaggler/test/test_ohe.py -> build/lib.macosx-10.7-x86_64-3.6/kaggler/test
        running build_ext
        skipping 'kaggler/online_model/ftrl.c' Cython extension (up-to-date)
        building 'kaggler.online_model.ftrl' extension
        creating build/temp.macosx-10.7-x86_64-3.6
        creating build/temp.macosx-10.7-x86_64-3.6/kaggler
        creating build/temp.macosx-10.7-x86_64-3.6/kaggler/online_model
        creating build/temp.macosx-10.7-x86_64-3.6/kaggler/online_model/murmurhash
        gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/anaconda2/envs/python3.6/include -arch x86_64 -I/anaconda2/envs/python3.6/include -arch x86_64 -I. -I/anaconda2/envs/python3.6/include/python3.6m -I/anaconda2/envs/python3.6/lib/python3.6/site-packages/numpy/core/include -c kaggler/online_model/ftrl.c -o build/temp.macosx-10.7-x86_64-3.6/kaggler/online_model/ftrl.o -O3
        In file included from kaggler/online_model/ftrl.c:594:
        In file included from /anaconda2/envs/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/arrayobject.h:4:
        In file included from /anaconda2/envs/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/ndarrayobject.h:12:
        In file included from /anaconda2/envs/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/ndarraytypes.h:1824:
        /anaconda2/envs/python3.6/lib/python3.6/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:17:2: warning: "Using deprecated NumPy API, disable it with "          "#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-W#warnings]
        #warning "Using deprecated NumPy API, disable it with " \
         ^
        1 warning generated.
        gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/anaconda2/envs/python3.6/include -arch x86_64 -I/anaconda2/envs/python3.6/include -arch x86_64 -I. -I/anaconda2/envs/python3.6/include/python3.6m -I/anaconda2/envs/python3.6/lib/python3.6/site-packages/numpy/core/include -c kaggler/online_model/murmurhash/MurmurHash3.cpp -o build/temp.macosx-10.7-x86_64-3.6/kaggler/online_model/murmurhash/MurmurHash3.o -O3
        warning: include path for stdlibc++ headers not found; pass '-stdlib=libc++' on the command line to use the libc++ standard library instead [-Wstdlibcxx-not-found]
        1 warning generated.
        g++ -bundle -undefined dynamic_lookup -L/anaconda2/envs/python3.6/lib -arch x86_64 -L/anaconda2/envs/python3.6/lib -arch x86_64 -arch x86_64 build/temp.macosx-10.7-x86_64-3.6/kaggler/online_model/ftrl.o build/temp.macosx-10.7-x86_64-3.6/kaggler/online_model/murmurhash/MurmurHash3.o -o build/lib.macosx-10.7-x86_64-3.6/kaggler/online_model/ftrl.cpython-36m-darwin.so
        clang: warning: libstdc++ is deprecated; move to libc++ with a minimum deployment target of OS X 10.9 [-Wdeprecated]
        ld: library not found for -lstdc++
        clang: error: linker command failed with exit code 1 (use -v to see invocation)
        error: command 'g++' failed with exit status 1
    
        ----------------------------------------
    Command "/anaconda2/envs/python3.6/bin/python -u -c "import setuptools, tokenize;__file__='/private/var/folders/1y/btgkmt992l94_1d37rkvhc380000gn/T/pip-install-ttr3it94/Kaggler/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /private/var/folders/1y/btgkmt992l94_1d37rkvhc380000gn/T/pip-record-g7a_hyv1/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /private/var/folders/1y/btgkmt992l94_1d37rkvhc380000gn/T/pip-install-ttr3it94/Kaggler/
    
    opened by yungmsh 5
  • pip install on ubuntu 16.04

    pip install on ubuntu 16.04

    By pip install kaggler

    on ubuntu 16.04 one would get:

    creating build/temp.linux-x86_64-2.7/kaggler/online_model/murmurhash
    x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -fPIC -I/usr/local/lib/python2.7/dist-packages/numpy/core/include -I. -I/usr/include/python2.7 -c kaggler/online_model/ftrl.c -o build/temp.linux-x86_64-2.7/kaggler/online_model/ftrl.o -O3
    In file included from /usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/ndarraytypes.h:1788:0,
                     from /usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/ndarrayobject.h:18,
                     from /usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/arrayobject.h:4,
                     from kaggler/online_model/ftrl.c:275:
    /usr/local/lib/python2.7/dist-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: #warning "Using deprecated NumPy API, disable it by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
     #warning "Using deprecated NumPy API, disable it by " \
      ^
    kaggler/online_model/ftrl.c:277:36: fatal error: murmurhash/MurmurHash3.h: No such file or directory
    compilation terminated.
    error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
    

    And I have found no way to install MurmurHash3.h on ubuntu 16.04

    opened by zilion22 4
  • ValueError: For early stopping, at least one dataset and eval metric is required for evaluation

    ValueError: For early stopping, at least one dataset and eval metric is required for evaluation

    When I run AutoLGB using objective="regression" and metric="neg_mean_absolute_error", I get an ValueError: For early stopping, at least one dataset and eval metric is required for evaluation error. Here is the complete stacktrace:

    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    <ipython-input-13-4052be1dbbca> in <module>
          3 model = AutoLGB(metric="neg_mean_absolute_error", 
          4                 objective="regression")
    ----> 5 model.tune(X_train, y_train)
          6 model.fit(X_train, y_train)
    
    /opt/conda/lib/python3.6/site-packages/kaggler/model/automl.py in tune(self, X, y)
        114             self.features = self.select_features(X_s,
        115                                                  y_s,
    --> 116                                                  n_eval=self.n_fs)
        117             logger.info('selecting {} out of {} features'.format(
        118                 len(self.features), X.shape[1])
    
    /opt/conda/lib/python3.6/site-packages/kaggler/model/automl.py in select_features(self, X, y, n_eval)
        164             random_cols.append(random_col)
        165 
    --> 166         _, trials = self.optimize_hyperparam(X.values, y.values, n_eval=n_eval)
        167 
        168         feature_importances = self._get_feature_importance(
    
    /opt/conda/lib/python3.6/site-packages/kaggler/model/automl.py in optimize_hyperparam(self, X, y, test_size, n_eval)
        258         best = hyperopt.fmin(fn=objective, space=self.space, trials=trials,
        259                              algo=tpe.suggest, max_evals=n_eval, verbose=1,
    --> 260                              rstate=self.random_state)
        261 
        262         hyperparams = space_eval(self.space, best)
    
    /opt/conda/lib/python3.6/site-packages/hyperopt/fmin.py in fmin(fn, space, algo, max_evals, trials, rstate, allow_trials_fmin, pass_expr_memo_ctrl, catch_eval_exceptions, verbose, return_argmin, points_to_evaluate, max_queue_len, show_progressbar)
        387             catch_eval_exceptions=catch_eval_exceptions,
        388             return_argmin=return_argmin,
    --> 389             show_progressbar=show_progressbar,
        390         )
        391 
    
    /opt/conda/lib/python3.6/site-packages/hyperopt/base.py in fmin(self, fn, space, algo, max_evals, max_queue_len, rstate, verbose, pass_expr_memo_ctrl, catch_eval_exceptions, return_argmin, show_progressbar)
        641             catch_eval_exceptions=catch_eval_exceptions,
        642             return_argmin=return_argmin,
    --> 643             show_progressbar=show_progressbar)
        644 
        645 
    
    /opt/conda/lib/python3.6/site-packages/hyperopt/fmin.py in fmin(fn, space, algo, max_evals, trials, rstate, allow_trials_fmin, pass_expr_memo_ctrl, catch_eval_exceptions, verbose, return_argmin, points_to_evaluate, max_queue_len, show_progressbar)
        406                     show_progressbar=show_progressbar)
        407     rval.catch_eval_exceptions = catch_eval_exceptions
    --> 408     rval.exhaust()
        409     if return_argmin:
        410         return trials.argmin
    
    /opt/conda/lib/python3.6/site-packages/hyperopt/fmin.py in exhaust(self)
        260     def exhaust(self):
        261         n_done = len(self.trials)
    --> 262         self.run(self.max_evals - n_done, block_until_done=self.asynchronous)
        263         self.trials.refresh()
        264         return self
    
    /opt/conda/lib/python3.6/site-packages/hyperopt/fmin.py in run(self, N, block_until_done)
        225                     else:
        226                         # -- loop over trials and do the jobs directly
    --> 227                         self.serial_evaluate()
        228 
        229                     try:
    
    /opt/conda/lib/python3.6/site-packages/hyperopt/fmin.py in serial_evaluate(self, N)
        139                 ctrl = base.Ctrl(self.trials, current_trial=trial)
        140                 try:
    --> 141                     result = self.domain.evaluate(spec, ctrl)
        142                 except Exception as e:
        143                     logger.info('job exception: %s' % str(e))
    
    /opt/conda/lib/python3.6/site-packages/hyperopt/base.py in evaluate(self, config, ctrl, attach_attachments)
        846                 memo=memo,
        847                 print_node_on_error=self.rec_eval_print_node_on_error)
    --> 848             rval = self.fn(pyll_rval)
        849 
        850         if isinstance(rval, (float, int, np.number)):
    
    /opt/conda/lib/python3.6/site-packages/kaggler/model/automl.py in objective(hyperparams)
        248                               valid_data,
        249                               early_stopping_rounds=self.n_stop,
    --> 250                               verbose_eval=0)
        251 
        252             score = (model.best_score["valid_0"][self.params["metric"]] *
    
    /opt/conda/lib/python3.6/site-packages/lightgbm/engine.py in train(params, train_set, num_boost_round, valid_sets, valid_names, fobj, feval, init_model, feature_name, categorical_feature, early_stopping_rounds, evals_result, verbose_eval, learning_rates, keep_training_booster, callbacks)
        231                                         begin_iteration=init_iteration,
        232                                         end_iteration=init_iteration + num_boost_round,
    --> 233                                         evaluation_result_list=evaluation_result_list))
        234         except callback.EarlyStopException as earlyStopException:
        235             booster.best_iteration = earlyStopException.best_iteration + 1
    
    /opt/conda/lib/python3.6/site-packages/lightgbm/callback.py in _callback(env)
        209     def _callback(env):
        210         if not cmp_op:
    --> 211             _init(env)
        212         if not enabled[0]:
        213             return
    
    /opt/conda/lib/python3.6/site-packages/lightgbm/callback.py in _init(env)
        190             return
        191         if not env.evaluation_result_list:
    --> 192             raise ValueError('For early stopping, '
        193                              'at least one dataset and eval metric is required for evaluation')
        194 
    
    ValueError: For early stopping, at least one dataset and eval metric is required for evaluation
    

    The pandas version is: 0.23.4 The ligthgbm version is: 2.2.3 The error might be due to the lightgbm version?

    opened by yassineAlouini 3
  • DAE References and Performance

    DAE References and Performance

    Hi @jeongyoonlee, I saw you added the DAE in the recent release! I didn't find a lot of references for DAE, so wondering if you could share a bit more? Additionally for the probability to add swap noise to features, how do we decide the probability to use here? Is there any rule of thumb to follow?

    I assume DAE will perform better on certain datasets with noise in the features, so by any chance do you some examples to share and potentially comparing the performance with other feature engineering methods we have in the pacakge?

    Thanks a lot!!

    question 
    opened by ppstacy 2
  • LabelEncoder Usage

    LabelEncoder Usage

    Hi, The following piece of code throws an error. Why?

    from kaggler.preprocessing import LabelEncoder
    le = LabelEncoder()
    le.fit_transform(pd.Series([1,1,1,2,2,2,3,3,3]))
    

    Error:

    ---------------------------------------------------------------------------
    IndexError                                Traceback (most recent call last)
    c:\Users\semic\Desktop\dsi19-oct\main.py in 
          1 le = LabelEncoder()
    ----> 2 le.fit_transform(pd.Series([1,1,1,2,2,2,3,3,3]))
    
    ~\Anaconda3\lib\site-packages\kaggler\preprocessing\categorical.py in fit_transform(self, X, y)
        121         """
        122 
    --> 123         self.label_encoders = [None] * X.shape[1]
        124         self.label_maxes = [None] * X.shape[1]
        125 
    
    IndexError: tuple index out of range
    
    opened by r0f1 2
  • use faster csr indexing

    use faster csr indexing

    It's a simple change, but I find it has huge performance improvement (n*10 times). I use the following code to profile:

    import cProfile
    
    import numpy as np
    np.random.seed(1234)
    import scipy.sparse as sps
    
    from kaggler.online_model import FTRL
    
    
    DATA_NUM = 1e6
    
    
    def main():
        print('create y...')
        y = np.random.randint(0, 1, DATA_NUM)
        print('create x...')
        row = np.random.randint(0, 300000, DATA_NUM)
        col = np.random.randint(0, 10, DATA_NUM)
        data = np.ones(DATA_NUM)
        x = sps.csr_matrix((data, (row, col)), dtype=np.int8)
    
        print('train...')
        profiler = cProfile.Profile(subcalls=True, builtins=True, timeunit=0.001,)
        clf = FTRL(interaction=True)
        profiler.enable()
        clf.fit(x, y)
        profiler.disable()
        profiler.print_stats()
    
    
    if __name__ == '__main__':
        main()
    

    And the profile result before:

             32400004 function calls (31800004 primitive calls) in 28.852 seconds
    
       Ordered by: standard name
    
       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       600000    0.207    0.000    0.681    0.000 <frozen importlib._bootstrap>:996(_handle_fromlist)
       900000    0.265    0.000    0.372    0.000 base.py:1081(isspmatrix)
      1200000    0.422    0.000    0.904    0.000 base.py:181(nnz)
       300000    0.202    0.000    0.202    0.000 base.py:70(__init__)
       300000    0.496    0.000    0.526    0.000 base.py:77(set_shape)
      1500001    0.253    0.000    0.253    0.000 base.py:99(get_shape)
       300000    1.071    0.000    2.149    0.000 compressed.py:1021(prune)
       300000    3.486    0.000    8.797    0.000 compressed.py:127(check_format)
       300000    1.626    0.000   15.347    0.000 compressed.py:24(__init__)
      1200000    0.482    0.000    0.482    0.000 compressed.py:99(getnnz)
       900000    0.166    0.000    0.166    0.000 csr.py:231(_swap)
       300000    0.697    0.000   25.280    0.000 csr.py:236(__getitem__)
       300000    0.577    0.000   20.043    0.000 csr.py:368(_get_row_slice)
       300000    1.270    0.000   19.240    0.000 csr.py:411(_get_submatrix)
       600000    0.509    0.000    1.006    0.000 csr.py:416(process_slice)
       600000    0.236    0.000    0.236    0.000 csr.py:439(check_bounds)
       300000    0.145    0.000    0.347    0.000 data.py:22(__init__)
          2/1    0.000    0.000    0.000    0.000 ftrl.pyx:125(fit)
    599999/300000    1.800    0.000    0.268    0.000 ftrl.pyx:156(update_one)
    599999/300000    1.772    0.000    0.457    0.000 ftrl.pyx:176(predict_one)
       600000    1.058    0.000    1.058    0.000 getlimits.py:245(__init__)
       600000    0.248    0.000    0.248    0.000 getlimits.py:270(max)
      2100000    0.929    0.000    1.622    0.000 numeric.py:414(asarray)
       600000    1.730    0.000    3.873    0.000 sputils.py:119(get_index_dtype)
       900000    1.181    0.000    1.883    0.000 sputils.py:188(isintlike)
       300000    0.936    0.000    0.936    0.000 sputils.py:200(isshape)
       900000    0.499    0.000    0.703    0.000 sputils.py:215(issequence)
       300000    1.193    0.000    3.067    0.000 sputils.py:265(_unpack_index)
       300000    0.118    0.000    0.149    0.000 sputils.py:293(_check_ellipsis)
       300000    0.642    0.000    1.205    0.000 sputils.py:331(_check_boolean)
       300000    0.274    0.000    0.741    0.000 sputils.py:91(to_native)
       600000    0.474    0.000    0.474    0.000 {built-in method builtins.hasattr}
      6000000    0.698    0.000    0.698    0.000 {built-in method builtins.isinstance}
      3000000    0.252    0.000    0.252    0.000 {built-in method builtins.len}
       300000    0.138    0.000    0.138    0.000 {built-in method builtins.max}
      3000000    1.155    0.000    1.155    0.000 {built-in method numpy.core.multiarray.array}
       300000    1.340    0.000    1.340    0.000 {built-in method scipy.sparse._sparsetools.get_csr_submatrix}
            1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
          2/1    0.000    0.000    0.000    0.000 {method 'fit' of 'kaggler.online_model.ftrl.FTRL' objects}
       300000    0.121    0.000    0.121    0.000 {method 'indices' of 'slice' objects}
       300000    0.185    0.000    0.185    0.000 {method 'newbyteorder' of 'numpy.dtype' objects}
    

    and after:

             1200004 function calls (600004 primitive calls) in 2.284 seconds
    
       Ordered by: standard name
    
       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
            1    0.000    0.000    0.000    0.000 base.py:99(get_shape)
          2/1    0.000    0.000    0.000    0.000 ftrl.pyx:125(fit)
    599999/300000    1.081    0.000    0.392    0.000 ftrl.pyx:156(update_one)
    599999/300000    1.203    0.000    0.473    0.000 ftrl.pyx:176(predict_one)
            1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
          2/1    0.000    0.000    0.000    0.000 {method 'fit' of 'kaggler.online_model.ftrl.FTRL' objects}
    

    The result is the same even when interaction=True: before

             32400004 function calls (31800004 primitive calls) in 32.136 seconds
    
       Ordered by: standard name
    
       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       600000    0.219    0.000    0.676    0.000 <frozen importlib._bootstrap>:996(_handle_fromlist)
       900000    0.264    0.000    0.366    0.000 base.py:1081(isspmatrix)
      1200000    0.421    0.000    0.918    0.000 base.py:181(nnz)
       300000    0.204    0.000    0.204    0.000 base.py:70(__init__)
       300000    0.504    0.000    0.533    0.000 base.py:77(set_shape)
      1500001    0.266    0.000    0.266    0.000 base.py:99(get_shape)
       300000    1.076    0.000    2.179    0.000 compressed.py:1021(prune)
       300000    3.610    0.000    8.970    0.000 compressed.py:127(check_format)
       300000    1.630    0.000   15.610    0.000 compressed.py:24(__init__)
      1200000    0.498    0.000    0.498    0.000 compressed.py:99(getnnz)
       900000    0.190    0.000    0.190    0.000 csr.py:231(_swap)
       300000    0.723    0.000   25.669    0.000 csr.py:236(__getitem__)
       300000    0.588    0.000   20.415    0.000 csr.py:368(_get_row_slice)
       300000    1.306    0.000   19.593    0.000 csr.py:411(_get_submatrix)
       600000    0.518    0.000    1.008    0.000 csr.py:416(process_slice)
       600000    0.243    0.000    0.243    0.000 csr.py:439(check_bounds)
       300000    0.145    0.000    0.349    0.000 data.py:22(__init__)
          2/1    0.000    0.000    0.000    0.000 ftrl.pyx:125(fit)
    599999/300000    3.054    0.000    1.472    0.000 ftrl.pyx:156(update_one)
    599999/300000    3.413    0.000    1.981    0.000 ftrl.pyx:176(predict_one)
       600000    1.069    0.000    1.069    0.000 getlimits.py:245(__init__)
       600000    0.268    0.000    0.268    0.000 getlimits.py:270(max)
      2100000    0.943    0.000    1.649    0.000 numeric.py:414(asarray)
       600000    1.702    0.000    3.899    0.000 sputils.py:119(get_index_dtype)
       900000    1.202    0.000    1.898    0.000 sputils.py:188(isintlike)
       300000    0.954    0.000    0.954    0.000 sputils.py:200(isshape)
       900000    0.493    0.000    0.696    0.000 sputils.py:215(issequence)
       300000    1.177    0.000    3.034    0.000 sputils.py:265(_unpack_index)
       300000    0.128    0.000    0.159    0.000 sputils.py:293(_check_ellipsis)
       300000    0.624    0.000    1.175    0.000 sputils.py:331(_check_boolean)
       300000    0.265    0.000    0.743    0.000 sputils.py:91(to_native)
       600000    0.456    0.000    0.456    0.000 {built-in method builtins.hasattr}
      6000000    0.702    0.000    0.702    0.000 {built-in method builtins.isinstance}
      3000000    0.249    0.000    0.249    0.000 {built-in method builtins.len}
       300000    0.141    0.000    0.141    0.000 {built-in method builtins.max}
      3000000    1.190    0.000    1.190    0.000 {built-in method numpy.core.multiarray.array}
       300000    1.381    0.000    1.381    0.000 {built-in method scipy.sparse._sparsetools.get_csr_submatrix}
            1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
          2/1    0.000    0.000    0.000    0.000 {method 'fit' of 'kaggler.online_model.ftrl.FTRL' objects}
       300000    0.129    0.000    0.129    0.000 {method 'indices' of 'slice' objects}
       300000    0.193    0.000    0.193    0.000 {method 'newbyteorder' of 'numpy.dtype' objects}
    

    after:

             1200004 function calls (600004 primitive calls) in 4.753 seconds
    
       Ordered by: standard name
    
       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
            1    0.000    0.000    0.000    0.000 base.py:99(get_shape)
          2/1    0.000    0.000    0.000    0.000 ftrl.pyx:125(fit)
    599999/300000    2.293    0.000    1.544    0.000 ftrl.pyx:156(update_one)
    599999/300000    2.460    0.000    1.613    0.000 ftrl.pyx:176(predict_one)
            1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
          2/1    0.000    0.000    0.000    0.000 {method 'fit' of 'kaggler.online_model.ftrl.FTRL' objects}
    

    Note that when profiling, I set #cython linetrace=True. The current version may run even faster since the overhead for profiling has gone.

    opened by stegben 2
  • Have combined the 4 pull requests into 1

    Have combined the 4 pull requests into 1

    Hi,

    I have put them into one request

    I just saw you in the first place of Otto Group Challenge. I'm also in the competition, but can't improve anymore after 0.43 lol...

    opened by yejiming 2
  • enhance DAE/SDAE

    enhance DAE/SDAE

    • add transfer learning with the pretrained_model input argument
    • allow to set the learning_rate in __init__()
    • add a test for transfer learning between DAE/SDAE
    enhancement 
    opened by jeongyoonlee 1
  • Development

    Development

    Is this package still in development? i found many error while implementing dae and sdae.

    Errors may have originated from an input operation.
    Input Source operations connected to node decoder_model/CabinType_emb/embedding_lookup:
     decoder_model/CabinType_emb/embedding_lookup/1202 (defined at /opt/conda/lib/python3.7/contextlib.py:112)
    
    Function call stack:
    train_function
    
    opened by naiborhujosua 0
  • from kaggler.model import AutoLGB does not work

    from kaggler.model import AutoLGB does not work

    After pip install -U Kaggler at Kaggle notebook, I was not able to import from kaggler.model import AutoLGB and got this error : TypeError: init_subclass() takes no keyword arguments

    opened by hokmingkwan 0
Releases(v0.9.15)
  • v0.9.15(Mar 6, 2022)

    What's Changed

    • Fix AutoLGB. Reformat with black by @jeongyoonlee in https://github.com/jeongyoonlee/Kaggler/pull/71
    • up the version to 0.9.15 by @jeongyoonlee in https://github.com/jeongyoonlee/Kaggler/pull/73
    • Update test.yml by @jeongyoonlee in https://github.com/jeongyoonlee/Kaggler/pull/69
    • Update python-publish.yml by @jeongyoonlee in https://github.com/jeongyoonlee/Kaggler/pull/70

    Full Changelog: https://github.com/jeongyoonlee/Kaggler/compare/v0.9.14...v0.9.15

    Source code(tar.gz)
    Source code(zip)
  • v0.9.14(Mar 5, 2022)

    What's Changed

    • Update python-publish.yml by @jeongyoonlee in https://github.com/jeongyoonlee/Kaggler/pull/64
    • add plot_curve() for plotting ROC and PR curves by @jeongyoonlee in https://github.com/jeongyoonlee/Kaggler/pull/66
    • fix build error by replacing ml_metrics's kappa with scikit-learn's by @jeongyoonlee in https://github.com/jeongyoonlee/Kaggler/pull/67
    • up the version to 0.9.14 by @jeongyoonlee in https://github.com/jeongyoonlee/Kaggler/pull/68

    Full Changelog: https://github.com/jeongyoonlee/Kaggler/compare/v0.9.13...v0.9.14

    Source code(tar.gz)
    Source code(zip)
  • v0.9.13(Jun 12, 2021)

    • add transfer learning with the pretrained_model input argument 
    • allow to set the learning_rate in __init__()
    • add a test for transfer learning between DAE/SDAE
    Source code(tar.gz)
    Source code(zip)
  • v0.9.12(Jun 12, 2021)

  • v0.9.11(Jun 10, 2021)

    • fix an error raised when printing out the DAE/SDAE objects
    • update random_state/seed arguments in DAE/SDAE/DAELayer to follow scikit-learn/tensorflow conventions
    • up the version to v0.9.11
    Source code(tar.gz)
    Source code(zip)
  • v0.9.10(Jun 8, 2021)

    • add options to add more than 1 encoder in DAELayer
    • add options to add validation_data in DAE/SDAE
    • make label-encoding optional in DAE/SDAE
    Source code(tar.gz)
    Source code(zip)
  • v0.9.9(Jun 4, 2021)

  • v0.9.8(Jun 2, 2021)

  • v0.9.7(Jun 1, 2021)

  • v0.9.5(May 18, 2021)

    • copy dataframe before transforming it in encoders to prevent overwriting
    • update the default threshold for feature selection in automl
    • fix DAE with all numeric features
    Source code(tar.gz)
    Source code(zip)
  • 0.9.0(Apr 29, 2021)

Owner
Jeong-Yoon Lee
Kaggler. CausalML. Father of Five.
Jeong-Yoon Lee
This repository has datasets containing information of Uber pickups in NYC from April 2014 to September 2014 and January to June 2015. data Analysis , virtualization and some insights are gathered here

uber-pickups-analysis Data Source: https://www.kaggle.com/fivethirtyeight/uber-pickups-in-new-york-city Information about data set The dataset contain

B DEVA DEEKSHITH 1 Nov 03, 2021
A benchmark of data-centric tasks from across the machine learning lifecycle.

A benchmark of data-centric tasks from across the machine learning lifecycle.

61 Dec 28, 2022
Stock Price Prediction Bank Jago Using Facebook Prophet Machine Learning & Python

Stock Price Prediction Bank Jago Using Facebook Prophet Machine Learning & Python Overview Bank Jago has attracted investors' attention since the end

Najibulloh Asror 3 Feb 10, 2022
Implementation of deep learning models for time series in PyTorch.

List of Implementations: Currently, the reimplementation of the DeepAR paper(DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks

Yunkai Zhang 275 Dec 28, 2022
Model Agnostic Confidence Estimator (MACEST) - A Python library for calibrating Machine Learning models' confidence scores

Model Agnostic Confidence Estimator (MACEST) - A Python library for calibrating Machine Learning models' confidence scores

Oracle 95 Dec 28, 2022
Repository for DCA0305, an undergraduate course about Machine Learning Workflows and Pipelines

Federal University of Rio Grande do Norte Technology Center Department of Computer Engineering and Automation Machine Learning Based Systems Design Re

Ivanovitch Silva 81 Oct 18, 2022
Machine-learning-dell - Repositório com as atividades desenvolvidas no curso de Machine Learning

📚 Descrição Neste curso da Dell aprofundamos nossos conhecimentos em Machine Learning. 🖥️ Aulas (Em curso) 1.1 - Python aplicado a Data Science 1.2

Claudia dos Anjos 1 Jan 05, 2022
Steganography is the art of hiding the fact that communication is taking place, by hiding information in other information.

Steganography is the art of hiding the fact that communication is taking place, by hiding information in other information.

Priyansh Sharma 7 Nov 09, 2022
MaD GUI is a basis for graphical annotation and computational analysis of time series data.

MaD GUI Machine Learning and Data Analytics Graphical User Interface MaD GUI is a basis for graphical annotation and computational analysis of time se

Machine Learning and Data Analytics Lab FAU 10 Dec 19, 2022
A comprehensive set of fairness metrics for datasets and machine learning models, explanations for these metrics, and algorithms to mitigate bias in datasets and models.

AI Fairness 360 (AIF360) The AI Fairness 360 toolkit is an extensible open-source library containg techniques developed by the research community to h

1.9k Jan 06, 2023
Optuna is an automatic hyperparameter optimization software framework, particularly designed for machine learning

Optuna is an automatic hyperparameter optimization software framework, particularly designed for machine learning. It features an imperative, define-by-run style user API.

7.4k Jan 04, 2023
Interactive Parallel Computing in Python

Interactive Parallel Computing with IPython ipyparallel is the new home of IPython.parallel. ipyparallel is a Python package and collection of CLI scr

IPython 2.3k Dec 30, 2022
Toolkit for building machine learning models that generalize to unseen domains and are robust to privacy and other attacks.

Toolkit for Building Robust ML models that generalize to unseen domains (RobustDG) Divyat Mahajan, Shruti Tople, Amit Sharma Privacy & Causal Learning

Microsoft 149 Jan 06, 2023
AtsPy: Automated Time Series Models in Python (by @firmai)

Automated Time Series Models in Python (AtsPy) SSRN Report Easily develop state of the art time series models to forecast univariate data series. Simp

Derek Snow 465 Jan 02, 2023
A demo project to elaborate how Machine Learn Models are deployed on production using Flask API

This is a salary prediction website developed with the help of machine learning, this makes prediction of salary on basis of few parameters like interview score, experience test score.

1 Feb 10, 2022
Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray

A unified Data Analytics and AI platform for distributed TensorFlow, Keras and PyTorch on Apache Spark/Flink & Ray What is Analytics Zoo? Analytics Zo

2.5k Dec 28, 2022
Distributed Evolutionary Algorithms in Python

DEAP DEAP is a novel evolutionary computation framework for rapid prototyping and testing of ideas. It seeks to make algorithms explicit and data stru

Distributed Evolutionary Algorithms in Python 4.9k Jan 05, 2023
Backprop makes it simple to use, finetune, and deploy state-of-the-art ML models.

Backprop makes it simple to use, finetune, and deploy state-of-the-art ML models. Solve a variety of tasks with pre-trained models or finetune them in

Backprop 227 Dec 10, 2022
Send rockets to Mars with artificial intelligence(Genetic algorithm) in python.

Send Rockets To Mars With AI Send rockets to Mars with artificial intelligence(Genetic algorithm) in python. Tools Python 3 EasyDraw How to Play Insta

Mohammad Dori 3 Jul 15, 2022
hgboost - Hyperoptimized Gradient Boosting

hgboost is short for Hyperoptimized Gradient Boosting and is a python package for hyperparameter optimization for xgboost, catboost and lightboost using cross-validation, and evaluating the results o

Erdogan Taskesen 34 Jan 03, 2023