Large-scale linear classification, regression and ranking in Python

Last update: Dec 31, 2022

Related tags

Sklearn Utilities machine-learning

Overview

https://travis-ci.org/scikit-learn-contrib/lightning.svg?branch=master

https://ci.appveyor.com/api/projects/status/mmm0llccmvn5iooq?svg=true

lightning

lightning is a library for large-scale linear classification, regression and ranking in Python.

Highlights:

follows the scikit-learn API conventions
supports natively both dense and sparse data representations
computationally demanding parts implemented in Cython

Solvers supported:

primal coordinate descent
dual coordinate descent (SDCA, Prox-SDCA)
SGD, AdaGrad, SAG, SAGA, SVRG
FISTA

Example

Example that shows how to learn a multiclass classifier with group lasso penalty on the News20 dataset (c.f., Blondel et al. 2013):

from sklearn.datasets import fetch_20newsgroups_vectorized
from lightning.classification import CDClassifier

# Load News20 dataset from scikit-learn.
bunch = fetch_20newsgroups_vectorized(subset="all")
X = bunch.data
y = bunch.target

# Set classifier options.
clf = CDClassifier(penalty="l1/l2",
                   loss="squared_hinge",
                   multiclass=True,
                   max_iter=20,
                   alpha=1e-4,
                   C=1.0 / X.shape[0],
                   tol=1e-3)

# Train the model.
clf.fit(X, y)

# Accuracy
print(clf.score(X, y))

# Percentage of selected features
print(clf.n_nonzero(percentage=True))

Dependencies

lightning requires Python >= 2.7, setuptools, Numpy >= 1.3, SciPy >= 0.7 and scikit-learn >= 0.15. Building from source also requires Cython and a working C/C++ compiler. To run the tests you will also need nose >= 0.10.

Installation

Precompiled binaries for the stable version of lightning are available for the main platforms and can be installed using pip:

pip install sklearn-contrib-lightning

or conda:

conda install -c conda-forge sklearn-contrib-lightning

The development version of lightning can be installed from its git repository. In this case it is assumed that you have the git version control system, a working C++ compiler, Cython and the numpy development libraries. In order to install the development version, type:

git clone https://github.com/scikit-learn-contrib/lightning.git
cd lightning
python setup.py build
sudo python setup.py install

Documentation

http://contrib.scikit-learn.org/lightning/

On Github

https://github.com/scikit-learn-contrib/lightning

Citing

If you use this software, please cite it. Here is a BibTex snippet that you can use:

@misc{lightning_2016,
  author       = {Blondel, Mathieu and
                  Pedregosa, Fabian},
  title        = {{Lightning: large-scale linear classification,
                 regression and ranking in Python}},
  year         = 2016,
  doi          = {10.5281/zenodo.200504},
  url          = {https://doi.org/10.5281/zenodo.200504}
}

Other citing formats are available in its Zenodo entry .

Authors

Mathieu Blondel, 2012-present
Manoj Kumar, 2015-present
Arnaud Rachez, 2016-present
Fabian Pedregosa, 2016-present

Comments

[MRG] Parallelize OvR method in primal_cd

@mblondel I was trying to get some speed gains by parallelizing the OvR method. However when I set n_jobs>1 it keeps failing with this error, TypeError: __cinit__() takes exactly 1 positional argument (0 given). Note that it works like how it is supposed to for n_jobs=1

opened by MechCoder 37
[WIP] Adding prox capability to SAGA.
Continuing #37 after discussing with @fabianp. Added prox capability in _sag_fit of file lightning/impl/sag_fast.pyx where @fabianp left room for it.

The proximity operator is currently specified when a classifier/regressor is built with the prox keyword (type ProxFunction mimicking LossFunction in lightning/impl/sgd_fast.pyx). Not sure this is the best way to specify it by default...

Notes prox implementation breaks sparse updates and the code is excruciatingly slow on sklearn.datasets.fetch_20newsgroups_vectorized (cf. this gist)

[x] Draf of proximity operators.

[x] Need to add tests.

[x] Add sparsity in L1
opened by zermelozf 31
[MRG] Just in time SAGA.
A squashed version of #38 ontaining:

SAGA algorithm in cython.

Basic python version of SAG and SAGA for testing.

Support for proximity operators through the Penalty base class.

L1 proximity operator with just in time update for sparse data.
opened by zermelozf 24
Documentation update

Hi @mblondel . Some of the recent additions (such as SAGA) don't show up in the webpage. Would you mind pushing a new version of the doc? (I wouldn't mind doing it myself if it was on github pages)

opened by fabianp 18
FIX for SAG with sparse samples.

The problem was that when the solution was updated just in time the different scaling accumulated were not considered. They were treated as if they had been constant in the last iterations.

This should fix issue #33 , although because of some python 3 incompatibility I've not yet run the full test suite.

opened by fabianp 14
raise AttributeError if predict_proba is not available
In scikit-learn when predit_proba method is not available, AttributeError is raised instead of NotImplementedError. In this PR:

classifiers are changed to follow the same convention;

removed predict_log_proba mentions because lightning doesn't provide this method;

added more tests for predict_proba results.
opened by kmike 12
0.1 release
I'd like to do a 0.1 release and upload binary packages to pypi and conda. TODO:

[x] Make binary conda packages for (at least) windows (appveyor).

[x] Update README with build instructions for binary packages.

[x] Update the website with the latests stable version.

[x] Create maintenance branch 0.1.X

[x] After release, upgrade version number to 0.2.dev0.

What do you think @mblondel ?
opened by fabianp 12
Release `0.6.2`

I believe Python 3.10 support that has been added recently (3afcb4a9967a0d9e3961acd967705e42a593e448) deserves new release of the package. In new release we'll upload wheels for Python 3.10 making users' life easier.

opened by StrikerRUS 11
Build artifacts at GitHub Actions

Wheels for all platforms and source archive will be automatically uploaded to Releases tab with each tagged commit.

For example please refer to https://github.com/StrikerRUS/lightning/releases/tag/untagged-a19e7c8d925f0295f2b6.

Unfortunately, neither manylinux2010 nor manylinux1 containers cannot be used due to the following restriction of Node.js: https://github.com/actions/runner/issues/337. But I think manylinux2014 is better than nothing. Moreover, CentOS 6 and CentOS 5 on which those containers are based have already reached their EOL. https://github.com/pypa/manylinux

opened by StrikerRUS 11
Should the .pxd files be included with the distribution?

I'm working on a package that uses lightning cython code as a dependency via:

from lightning.impl.dataset_fast cimport ColumnDataset.

When installing lightning via conda or pip, generating the cython file fails, but if I distribute the generated cpp files the code runs fine.

Should the .pxd files be distributed with lightning to allow this use case?

opened by vene 10
[HOTFIX] fix compatibility with new scikit-learn version
This PR will allow using lightning with the latest version (0.23.0) of scikit-learn. Right now if you try to upgrade scikit-learn, lightning fails with the error about that it cannot import neither joblib nor six because they are no longer exist in sklearn.externals:

from lightning.classification import KernelSVC ../../../virtualenv/python3.6.7/lib/python3.6/site-packages/lightning/classification.py:1: in <module> from .impl.adagrad import AdaGradClassifier ../../../virtualenv/python3.6.7/lib/python3.6/site-packages/lightning/impl/adagrad.py:8: in <module> from sklearn.externals.six.moves import xrange

six was dropped along with Python 2 support. https://scikit-learn.org/stable/whats_new/v0.21.html#sklearn-externals

joblib is now a dependency: https://scikit-learn.org/stable/whats_new/v0.21.html#miscellaneous

This PR should be treated as a hotfix, and ideally lightning should drop the support of Python 2 with six dependency.
opened by StrikerRUS 9
Why not initialize SAG/SAGA memory with 0 and divide by seen indices so far as in sklearn?

Why you don't use initialize gradient memory with 0 and use the number of indices seen so far in SAG algorithm as suggested in the paper

In the update of x in Algorithm 1, we normalize the direction d by the total number of data points n. When initializing with y_i = 0 we believe this leads to steps that are too small on early iterations of the algorithm where we have only seen a fraction of the data points, because many y_i variables contributing to d are set to the uninformative zero-vector. Following Blatt et al. [2007], the more logical normalization is to divide d by m, the number of data points that we have seen at least once

SAGA paper suggests a similar procedure

Our algorithm assumes that initial gradients are known for each f_i at the starting point x0. Instead, a heuristic may be used where during the first pass, data-points are introduced one-by-one, in a non-randomized order, with averages computed in terms of those data-points processed so far. This procedure has been successfully used with SAG [1].

opened by NikZak 0

DOC: sometimes the Lasso solution is the same as sklearn, sometimes not

Hi @mblondel @fabianp I think this will be short to answer, why is the solution sometimes equal to that of sklearn, and sometimes not ?

This should be quick to reproduce, look at 1st and 3rd result over 5 seeds:

import numpy as np
from numpy.linalg import norm
from lightning.regression import CDRegressor
from sklearn.linear_model import Lasso

np.random.seed(0)
X = np.random.randn(200, 500)
beta = np.ones(X.shape[1])
beta[20:] = 0
y = X @ beta + 0.3 * np.random.randn(X.shape[0])
alpha = norm(X.T @ y, ord=np.inf) / 10


def p_obj(X, y, alpha, w):
    return norm(y - X @ w) ** 2 / 2 + alpha * norm(w, ord=1)


for seed in range(5):
    print('-' * 80)
    clf = CDRegressor(C=0.5, alpha=alpha, penalty='l1',
                      tol=1-30, random_state=seed)
    clf.fit(X, y)

    las = Lasso(fit_intercept=False, alpha=alpha/len(y), tol=1e-10).fit(X, y)
    print(norm(clf.coef_[0] - las.coef_))

    light_o = p_obj(X, y, alpha, clf.coef_[0])
    sklea_o = p_obj(X, y, alpha, las.coef_)

    print(light_o - sklea_o)

ping @qb3 @agramfort

opened by mathurinm 5

do you have Regression for spars categorical big data after one hot transformation

do you have Regression for spars categorical big data after one hot transformation

then data is spars and only ones and zeros values many zeros and few ones?

opened by Sandy4321 0

Releases(0.6.2.post0)

Owner

scikit-learn compatible projects

GitHub Repository http://contrib.scikit-learn.org/lightning/

Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Optimization Algorithm,Immune Algorithm, Artificial Fish Swarm Algorithm, Differential Evolution and TSP(Traveling salesman)

scikit-opt Swarm Intelligence in Python (Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Algorithm, Immune Algorithm,A

3.7k Jan 01, 2023

Large-scale linear classification, regression and ranking in Python

Related tags

Overview

lightning

Example

Dependencies

Installation

Documentation

On Github

Citing

Authors

Comments

Releases(0.6.2.post0)

0.6.2.post0(Jan 30, 2022)

0.6.2(Jan 29, 2022)

0.6.1(Jun 16, 2021)

Owner

A scikit-learn based module for multi-label et. al. classification

A library of sklearn compatible categorical variable encoders

Extra blocks for scikit-learn pipelines.

Fast solver for L1-type problems: Lasso, sparse Logisitic regression, Group Lasso, weighted Lasso, Multitask Lasso, etc.

machine learning with logical rules in Python

Large-scale linear classification, regression and ranking in Python

A library of extension and helper modules for Python's data analysis and machine learning libraries.

Scikit-learn compatible estimation of general graphical models

A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning

Topological Data Analysis for Python🐍

Data Analysis Baseline Library

scikit-learn inspired API for CRFsuite

Multivariate imputation and matrix completion algorithms implemented in Python

scikit-learn cross validators for iterative stratification of multilabel data

(AAAI' 20) A Python Toolbox for Machine Learning Model Combination

A Python library for dynamic classifier and ensemble selection

Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Optimization Algorithm,Immune Algorithm, Artificial Fish Swarm Algorithm, Differential Evolution and TSP(Traveling salesman)