A python fast implementation of the famous SVD algorithm popularized by Simon Funk during Netflix Prize

Overview

funk-svd Build Status License

funk-svd is a Python 3 library implementing a fast version of the famous SVD algorithm popularized by Simon Funk during the Neflix Prize contest.

Numba is used to speed up our algorithm, enabling us to run over 10 times faster than Surprise's Cython implementation (cf. benchmark notebook).

Movielens 20M RMSE MAE Time
Surprise 0.88 0.68 10 min 40 sec
Funk-svd 0.88 0.68 42 sec

Installation

Run pip install git+https://github.com/gbolmier/funk-svd in your terminal.

Contributing

All contributions, bug reports, bug fixes, enhancements, and ideas are welcome.

A detailed overview on how to contribute can be found in the contributor guide.

Quick example

run_experiment.py:

>>> from funk_svd.dataset import fetch_ml_ratings
>>> from funk_svd import SVD

>>> from sklearn.metrics import mean_absolute_error


>>> df = fetch_ml_ratings(variant='100k')

>>> train = df.sample(frac=0.8, random_state=7)
>>> val = df.drop(train.index.tolist()).sample(frac=0.5, random_state=8)
>>> test = df.drop(train.index.tolist()).drop(val.index.tolist())

>>> svd = SVD(lr=0.001, reg=0.005, n_epochs=100, n_factors=15,
...           early_stopping=True, shuffle=False, min_rating=1, max_rating=5)

>>> svd.fit(X=train, X_val=val)
Preprocessing data...

Epoch 1/...

>>> pred = svd.predict(test)
>>> mae = mean_absolute_error(test['rating'], pred)

>>> print(f'Test MAE: {mae:.2f}')
Test MAE: 0.75

Funk SVD for recommendation in a nutshell

We have a huge sparse matrix:

storing known ratings for a set of users and items:

The idea is to estimate unknown ratings by factorizing the rating matrix into two smaller matrices representing user and item characteristics:

We call these two matrices users and items latent factors. Then, by applying the dot product between both matrices we can reconstruct our rating matrix. The trick is that the empty values will now contain estimated ratings.

In order to get more accurate results, the global average rating as well as the user and item biases are used in addition:

where K stands for known ratings.

Then, we can estimate any rating by applying:

The learning step consists in performing the SGD algorithm where for each known rating the biases and latent factors are updated as follows:

where alpha is the learning rate and lambda is the regularization term.

References

License

MIT license, see here.

Owner
Geoffrey Bolmier
Geoffrey Bolmier
A repository to index and organize the latest machine learning courses found on YouTube.

📺 ML YouTube Courses At DAIR.AI we ❤️ open education. We are excited to share some of the best and most recent machine learning courses available on

DAIR.AI 9.6k Jan 01, 2023
This project impelemented for midterm of the Machine Learning #Zoomcamp #Alexey Grigorev

MLProject_01 This project impelemented for midterm of the Machine Learning #Zoomcamp #Alexey Grigorev Context Dataset English question data set file F

Hadi Nakhi 1 Dec 18, 2021
This is a curated list of medical data for machine learning

Medical Data for Machine Learning This is a curated list of medical data for machine learning. This list is provided for informational purposes only,

Andrew L. Beam 5.4k Dec 26, 2022
Uplift modeling and causal inference with machine learning algorithms

Disclaimer This project is stable and being incubated for long-term support. It may contain new experimental code, for which APIs are subject to chang

Uber Open Source 3.7k Jan 07, 2023
MasTrade is a trading bot in baselines3,pytorch,gym

mastrade MasTrade is a trading bot in baselines3,pytorch,gym idea we have for example 1 btc and we buy a crypto with it with market option to trade in

Masoud Azizi 18 May 24, 2022
Greykite: A flexible, intuitive and fast forecasting library

The Greykite library provides flexible, intuitive and fast forecasts through its flagship algorithm, Silverkite.

LinkedIn 1.4k Jan 15, 2022
Gaussian Process Optimization using GPy

End of maintenance for GPyOpt Dear GPyOpt community! We would like to acknowledge the obvious. The core team of GPyOpt has moved on, and over the past

Sheffield Machine Learning Software 847 Dec 19, 2022
A Powerful Serverless Analysis Toolkit That Takes Trial And Error Out of Machine Learning Projects

KXY: A Seemless API to 10x The Productivity of Machine Learning Engineers Documentation https://www.kxy.ai/reference/ Installation From PyPi: pip inst

KXY Technologies, Inc. 35 Jan 02, 2023
A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

Master status: Development status: Package information: TPOT stands for Tree-based Pipeline Optimization Tool. Consider TPOT your Data Science Assista

Epistasis Lab at UPenn 8.9k Jan 09, 2023
TensorFlow Decision Forests (TF-DF) is a collection of state-of-the-art algorithms for the training, serving and interpretation of Decision Forest models.

TensorFlow Decision Forests (TF-DF) is a collection of state-of-the-art algorithms for the training, serving and interpretation of Decision Forest models. The library is a collection of Keras models

538 Jan 01, 2023
Python library which makes it possible to dynamically mask/anonymize data using JSON string or python dict rules in a PySpark environment.

pyspark-anonymizer Python library which makes it possible to dynamically mask/anonymize data using JSON string or python dict rules in a PySpark envir

6 Jun 30, 2022
A Python library for choreographing your machine learning research.

A Python library for choreographing your machine learning research.

AI2 270 Jan 06, 2023
SynapseML - an open source library to simplify the creation of scalable machine learning pipelines

Synapse Machine Learning SynapseML (previously MMLSpark) is an open source library to simplify the creation of scalable machine learning pipelines. Sy

Microsoft 3.9k Dec 30, 2022
Python 3.6+ toolbox for submitting jobs to Slurm

Submit it! What is submitit? Submitit is a lightweight tool for submitting Python functions for computation within a Slurm cluster. It basically wraps

Facebook Incubator 768 Jan 03, 2023
Pandas Machine Learning and Quant Finance Library Collection

Pandas Machine Learning and Quant Finance Library Collection

148 Dec 07, 2022
Anomaly Detection and Correlation library

luminol Overview Luminol is a light weight python library for time series data analysis. The two major functionalities it supports are anomaly detecti

LinkedIn 1.1k Jan 01, 2023
A comprehensive set of fairness metrics for datasets and machine learning models, explanations for these metrics, and algorithms to mitigate bias in datasets and models.

AI Fairness 360 (AIF360) The AI Fairness 360 toolkit is an extensible open-source library containg techniques developed by the research community to h

1.9k Jan 06, 2023
Kalman filter library

The kalman filter framework described here is an incredibly powerful tool for any optimization problem, but particularly for visual odometry, sensor fusion localization or SLAM.

comma.ai 276 Jan 01, 2023
Both social media sentiment and stock market data are crucial for stock price prediction

Relating-Social-Media-to-Stock-Movement-Public - We explore the application of Machine Learning for predicting the return of the stock by using the information of stock returns. A trading strategy ba

Vishal Singh Parmar 15 Oct 29, 2022
Open-Source CI/CD platform for ML teams. Deliver ML products, better & faster. ⚡️🧑‍🔧

Deliver ML products, better & faster Giskard is an Open-Source CI/CD platform for ML teams. Inspect ML models visually from your Python notebook 📗 Re

Giskard 335 Jan 04, 2023