A benchmark of data-centric tasks from across the machine learning lifecycle.

Last update: Dec 28, 2022

Overview

A benchmark of data-centric tasks from across the machine learning lifecycle.

⚡️ Quickstart

pip install dcbench

Optional: some parts of Meerkat rely on optional dependencies. If you know which optional dependencies you'd like to install, you can do so using something like pip install dcbench[dev] instead. See setup.py for a full list of optional dependencies.

Installing from dev: pip install "dcbench[dev] @ git+https://github.com/data-centric-ai/[email protected]"

Using a Jupyter notebook or some other interactive environment, you can import the library and explore the data-centric problems in the benchmark:

import dcbench
dcbench.tasks

To learn more, follow the walkthrough in the docs.

💡 What is dcbench?

This benchmark evaluates the steps in your machine learning workflow beyond model training and tuning. This includes feature cleaning, slice discovery, and coreset selection. We call these “data-centric” tasks because they're focused on exploring and manipulating data – not training models. dcbench supports a growing list of them:

dcbench includes tasks that look very different from one another: the inputs and outputs of the slice discovery task are not the same as those of the minimal data cleaning task. However, we think it important that researchers and practitioners be able to run evaluations on data-centric tasks across the ML lifecycle without having to learn a bunch of different APIs or rewrite evaluation scripts.

So, dcbench is designed to be a common home for these diverse, but related, tasks. In dcbench all of these tasks are structured in a similar manner and they are supported by a common Python API that makes it easy to download data, run evaluations, and compare methods.

✉️ About

dcbench is being developed alongside the data-centric-ai benchmark. Reach out to Bojan Karlaš (karlasb [at] inf [dot] ethz [dot] ch) and Sabri Eyuboglu (eyuboglu [at] stanford [dot] edu if you would like to get involved or contribute!)

You might also like...

Data science, Data manipulation and Machine learning package.

duality Data science, Data manipulation and Machine learning package. Use permitted according to the terms of use and conditions set by the attached l

3 Oct 19, 2022

Data Version Control or DVC is an open-source tool for data science and machine learning projects

Continuous Machine Learning project integration with DVC Data Version Control or DVC is an open-source tool for data science and machine learning proj

2 Jul 29, 2021

A mindmap summarising Machine Learning concepts, from Data Analysis to Deep Learning.

5.7k Dec 30, 2022

A toolkit for making real world machine learning and data analysis applications in C++

dlib C++ library Dlib is a modern C++ toolkit containing machine learning algorithms and tools for creating complex software in C++ to solve real worl

11.6k Jan 2, 2023

A library of extension and helper modules for Python's data analysis and machine learning libraries.

Mlxtend (machine learning extensions) is a Python library of useful tools for the day-to-day data science tasks. Sebastian Raschka 2014-2021 Links Doc

4.2k Dec 29, 2022

A machine learning toolkit dedicated to time-series data

tslearn The machine learning toolkit for time series analysis in Python Section Description Installation Installing the dependencies and tslearn Getti

2.3k Jan 5, 2023

A machine learning toolkit dedicated to time-series data

tslearn The machine learning toolkit for time series analysis in Python Section Description Installation Installing the dependencies and tslearn Getti

2.3k Dec 29, 2022

Apache Liminal is an end-to-end platform for data engineers & scientists, allowing them to build, train and deploy machine learning models in a robust and agile way

Apache Liminals goal is to operationalise the machine learning process, allowing data scientists to quickly transition from a successful experiment to an automated pipeline of model training, validation, deployment and inference in production. Liminal provides a Domain Specific Language to build ML workflows on top of Apache Airflow.

121 Dec 28, 2022

Meerkat provides fast and flexible data structures for working with complex machine learning datasets.

Meerkat makes it easier for ML practitioners to interact with high-dimensional, multi-modal data. It provides simple abstractions for data inspection, model evaluation and model training supported by efficient and robust IO under the hood.

115 Dec 12, 2022

Comments

No module named 'dcbench.tasks.budgetclean.cpclean'

After installing dcbench in Google colab environment, the above error was thrown for import dcbench. Full error traceback,

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-8-a1030f6d7ef9> in <module>()
      1 
----> 2 import dcbench
      3 dcbench.tasks

2 frames
/usr/local/lib/python3.7/dist-packages/dcbench/__init__.py in <module>()
     13 )
     14 from .config import config
---> 15 from .tasks.budgetclean import BudgetcleanProblem
     16 from .tasks.minidata import MiniDataProblem
     17 from .tasks.slice_discovery import SliceDiscoveryProblem

/usr/local/lib/python3.7/dist-packages/dcbench/tasks/budgetclean/__init__.py in <module>()
      3 from ...common import Task
      4 from ...common.table import Table
----> 5 from .baselines import cp_clean, random_clean
      6 from .common import Preprocessor
      7 from .problem import BudgetcleanProblem, BudgetcleanSolution

/usr/local/lib/python3.7/dist-packages/dcbench/tasks/budgetclean/baselines.py in <module>()
      6 from ...common.baseline import baseline
      7 from .common import Preprocessor
----> 8 from .cpclean.algorithm.select import entropy_expected
      9 from .cpclean.algorithm.sort_count import sort_count_after_clean_multi
     10 from .cpclean.clean import CPClean, Querier

ModuleNotFoundError: No module named 'dcbench.tasks.budgetclean.cpclean'

!pip install dcbench gave the following log

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. 
flask 1.1.4 requires click<8.0,>=5.1, but you have click 8.0.3 which is incompatible.
datascience 0.10.6 requires coverage==3.7.1, but you have coverage 6.2 which is incompatible.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.
coveralls 0.5 requires coverage<3.999,>=3.6, but you have coverage 6.2 which is incompatible.
Successfully installed SecretStorage-3.3.1 aiohttp-3.8.1 aiosignal-1.2.0 antlr4-python3-runtime-4.8 async-timeout-4.0.2 asynctest-0.13.0 black-21.12b0 cfgv-3.3.1 click-8.0.3 colorama-0.4.4 commonmark-0.9.1 coverage-6.2 cryptography-36.0.1 cytoolz-0.11.2 dataclasses-0.6 datasets-1.17.0 dcbench-0.0.4 distlib-0.3.4 docformatter-1.4 flake8-4.0.1 frozenlist-1.2.0 fsspec-2021.11.1 future-0.18.2 fuzzywuzzy-0.18.0 fvcore-0.1.5.post20211023 huggingface-hub-0.2.1 identify-2.4.1 importlib-metadata-4.2.0 iopath-0.1.9 isort-5.10.1 jeepney-0.7.1 jsonlines-3.0.0 keyring-23.4.0 livereload-2.6.3 markdown-3.3.4 mccabe-0.6.1 meerkat-ml-0.2.3 multidict-5.2.0 mypy-extensions-0.4.3 nbsphinx-0.8.8 nodeenv-1.6.0 omegaconf-2.1.1 parameterized-0.8.1 pathspec-0.9.0 pkginfo-1.8.2 platformdirs-2.4.1 pluggy-1.0.0 portalocker-2.3.2 pre-commit-2.16.0 progressbar-2.5 pyDeprecate-0.3.1 pycodestyle-2.8.0 pyflakes-2.4.0 pytest-6.2.5 pytest-cov-3.0.0 pytorch-lightning-1.5.7 pyyaml-6.0 readme-renderer-32.0 recommonmark-0.7.1 requests-toolbelt-0.9.1 rfc3986-1.5.0 sphinx-autobuild-2021.3.14 sphinx-rtd-theme-1.0.0 torchmetrics-0.6.2 twine-3.7.1 typed-ast-1.5.1 ujson-5.1.0 untokenize-0.1.1 virtualenv-20.12.1 xxhash-2.0.2 yacs-0.1.8 yarl-1.7.2
WARNING: The following packages were previously imported in this runtime:
  [pydevd_plugins]
You must restart the runtime in order to use newly installed versions.

python version : 3.7.12 platform: Linux-5.4.144+-x86_64-with-Ubuntu-18.04-bionic

opened by mathav95raj 2

Slice discovery problem p_72411 misses files
Hi,

Thanks for this great tool!

I'm loading slice discovery problems, however, the problem p_72411 misses files. Can you fix this SD problem?

FileNotFoundError: [Errno 2] No such file or directory: '/home/user/.dcbench/slice_discovery/problem/artifacts/p_72411/test_predictions.mk/meta.yaml'
opened by duguyue100 0

Releases(v-0.0.1-beta)

v-0.0.1-beta(Nov 5, 2021)

Source code(tar.gz)
Source code(zip)

Owner

GitHub Repository https://www.datacentricai.cc/

Interactive Parallel Computing in Python

Interactive Parallel Computing with IPython ipyparallel is the new home of IPython.parallel. ipyparallel is a Python package and collection of CLI scr

2.3k Dec 30, 2022

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

eXtreme Gradient Boosting Community | Documentation | Resources | Contributors | Release Notes XGBoost is an optimized distributed gradient boosting l

23.6k Jan 03, 2023

CVXPY is a Python-embedded modeling language for convex optimization problems.

CVXPY The CVXPY documentation is at cvxpy.org. We are building a CVXPY community on Discord. Join the conversation! For issues and long-form discussio

4.3k Jan 08, 2023

Azure MLOps (v2) solution accelerators.

Azure MLOps (v2) solution accelerator Welcome to the MLOps (v2) solution accelerator repository! This project is intended to serve as the starting poi

233 Jan 01, 2023

All-in-one web-based development environment for machine learning

All-in-one web-based development environment for machine learning Getting Started • Features & Screenshots • Support • Report a Bug • FAQ • Known Issu

3 Feb 03, 2021

Implementation of deep learning models for time series in PyTorch.

List of Implementations: Currently, the reimplementation of the DeepAR paper(DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks

275 Dec 28, 2022

As we all know the BGMI Loot Crate comes with so many resources for the gamers, this ML Crate will be the hub of various ML projects which will be the resources for the ML enthusiasts! Open Source Program: SWOC 2021 and JWOC 2022.

Machine Learning Loot Crate 💻 🧰 🔴 Welcome contributors! As we all know the BGMI Loot Crate comes with so many resources for the gamers, this ML Cra

89 Dec 28, 2022

A quick reference guide to the most commonly used patterns and functions in PySpark SQL

Using PySpark we can process data from Hadoop HDFS, AWS S3, and many file systems. PySpark also is used to process real-time data using Streaming and

53 Dec 21, 2022

A Software Framework for Neuromorphic Computing

338 Dec 26, 2022

Steganography is the art of hiding the fact that communication is taking place, by hiding information in other information.

7 Nov 09, 2022

Lightning ⚡️ fast forecasting with statistical and econometric models.

Nixtla Statistical ⚡️ Forecast Lightning fast forecasting with statistical and econometric models StatsForecast offers a collection of widely used uni

2.1k Dec 29, 2022

An open-source library of algorithms to analyse time series in GPU and CPU.

216 Dec 30, 2022

Open MLOps - A Production-focused Open-Source Machine Learning Framework

Open MLOps - A Production-focused Open-Source Machine Learning Framework Open MLOps is a set of open-source tools carefully chosen to ease user experi

590 Dec 28, 2022

The project's goal is to show a real world application of image segmentation using k means algorithm

2 Jan 22, 2022

Module is created to build a spam filter using Python and the multinomial Naive Bayes algorithm.

Naive-Bayes Spam Classificator Module is created to build a spam filter using Python and the multinomial Naive Bayes algorithm. Main goal is to code a

1 Jun 27, 2022

PLUR is a collection of source code datasets suitable for graph-based machine learning.

PLUR (Programming-Language Understanding and Repair) is a collection of source code datasets suitable for graph-based machine learning. We provide scripts for downloading, processing, and loading the

76 Nov 25, 2022

ParaMonte is a serial/parallel library of Monte Carlo routines for sampling mathematical objective functions of arbitrary-dimensions

ParaMonte is a serial/parallel library of Monte Carlo routines for sampling mathematical objective functions of arbitrary-dimensions, in particular, the posterior distributions of Bayesian models in

182 Dec 31, 2022

A benchmark of data-centric tasks from across the machine learning lifecycle.

Related tags

Overview

⚡️ Quickstart

💡 What is dcbench?

✉️ About

You might also like...

Data science, Data manipulation and Machine learning package.

Data Version Control or DVC is an open-source tool for data science and machine learning projects

A mindmap summarising Machine Learning concepts, from Data Analysis to Deep Learning.

A toolkit for making real world machine learning and data analysis applications in C++

A library of extension and helper modules for Python's data analysis and machine learning libraries.

A machine learning toolkit dedicated to time-series data

A machine learning toolkit dedicated to time-series data

Apache Liminal is an end-to-end platform for data engineers & scientists, allowing them to build, train and deploy machine learning models in a robust and agile way

Meerkat provides fast and flexible data structures for working with complex machine learning datasets.

Comments

No module named 'dcbench.tasks.budgetclean.cpclean'

Slice discovery problem p_72411 misses files

Releases(v-0.0.1-beta)

v-0.0.1-beta(Nov 5, 2021)

Owner

Interactive Parallel Computing in Python

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

CVXPY is a Python-embedded modeling language for convex optimization problems.

Azure MLOps (v2) solution accelerators.

All-in-one web-based development environment for machine learning

Implementation of deep learning models for time series in PyTorch.

As we all know the BGMI Loot Crate comes with so many resources for the gamers, this ML Crate will be the hub of various ML projects which will be the resources for the ML enthusiasts! Open Source Program: SWOC 2021 and JWOC 2022.

A quick reference guide to the most commonly used patterns and functions in PySpark SQL

A Software Framework for Neuromorphic Computing

Steganography is the art of hiding the fact that communication is taking place, by hiding information in other information.

Lightning ⚡️ fast forecasting with statistical and econometric models.

An open-source library of algorithms to analyse time series in GPU and CPU.

Open MLOps - A Production-focused Open-Source Machine Learning Framework

The project's goal is to show a real world application of image segmentation using k means algorithm

Module is created to build a spam filter using Python and the multinomial Naive Bayes algorithm.

PLUR is a collection of source code datasets suitable for graph-based machine learning.

ParaMonte is a serial/parallel library of Monte Carlo routines for sampling mathematical objective functions of arbitrary-dimensions

Getting Profit and Loss Make Easy From Binance

Applied Machine Learning for Graduate Program in Computer Science (PPGCC)

A basic Ray Tracer that exploits numpy arrays and functions to work fast.