Meerkat provides fast and flexible data structures for working with complex machine learning datasets.

Overview
Meerkat logo

GitHub Workflow Status GitHub Documentation Status pre-commit

Meerkat provides fast and flexible data structures for working with complex machine learning datasets.

Getting Started | What is Meerkat? | Supported Columns | Docs | Contributing | About

Getting started

pip install meerkat-ml

Note: some parts of Meerkat rely on optional dependencies. If you know which optional dependencies you'd like to install, you can do so using something like pip install meerkat-ml[dev,text] instead. See setup.py for a full list of optional dependencies.

Load your dataset into a DataPanel and get going!

import meerkat as mk
dp = mk.DataPanel.from_csv("...")

What is Meerkat?

Meerkat makes it easier for ML practitioners to interact with high-dimensional, multi-modal data. It provides simple abstractions for data inspection, model evaluation and model training supported by efficient and robust IO under the hood.

Meerkat's core contribution is the DataPanel, a simple columnar data abstraction. The Meerkat DataPanel can house columns of arbitrary type – from integers and strings to complex, high-dimensional objects like videos, images, medical volumes and graphs.

DataPanel loads high-dimensional data lazily. A full high-dimensional dataset won't typically fit in memory. Behind the scenes, DataPanel handles this by only materializing these objects when they are needed.

import meerkat as mk

# Images are NOT read from disk at DataPanel creation...
dp = mk.DataPanel({
    'text': ['The quick brown fox.', 'Jumped over.', 'The lazy dog.'],
    'image': mk.ImageColumn.from_filepaths(['fox.png', 'jump.png', 'dog.png']),
    'label': [0, 1, 0]
}) 

# ...only at this point is "fox.png" read from disk
dp["image"][0]

DataPanel supports advanced indexing. Using indexing patterns similar to those of Pandas and NumPy, we can access a subset of a DataPanel's rows and columns.

import meerkat as mk
dp = ... # create DataPanel

# Pull a column out of the DataPanel
new_col: mk.ImageColumn = dp["image"]

# Create a new DataPanel from a subset of the columns in an existing one
new_dp: mk.DataPanel = dp[["image", "label"]] 

# Create a new DataPanel from a subset of the rows in an existing one
new_dp: mk.DataPanel = dp[10:20] 
new_dp: mk.DataPanel = dp[np.array([0,2,4,8])]

# Pull a column out of the DataPanel and get a subset of its rows 
new_col: mk.ImageColumn = dp["image"][10:20]

DataPanel supports map, update and filter operations. When training and evaluating our models, we often perform operations on each example in our dataset (e.g. compute a model's prediction on each example, tokenize each sentence, compute a model's embedding for each example) and store them . The DataPanel makes it easy to perform these operations and produce new columns (via DataPanel.map), store the columns alongside the original data (via DataPanel.update), and extract an important subset of the datset (via DataPanel.filter). Under the hood, dataloading is multiprocessed so that costly I/O doesn't bottleneck our computation. Consider the example below where we use update a DataPanel with two new columns holding model predictions and probabilities.

# A simple evaluation loop using Meerkat 
dp: DataPane = ... # get DataPane
model: nn.Module = ... # get the model
model.to(0).eval() # prepare the model for evaluation

@torch.no_grad()
def predict(batch: dict):
    probs = torch.softmax(model(batch["input"].to(0)), dim=-1)
    return {"probs": probs.cpu(), "pred": probs.cpu().argmax(dim=-1)}

# updated_dp has two new `TensorColumn`s: 1 for probabilities and one
# for predictions
updated_dp: mk.DataPanel = dp.update(function=predict, batch_size=128, is_batched_fn=True)

DataPanel is extendable. Meerkat makes it easy for you to make custom column types for our data. The easiest way to do this is by subclassing AbstractCell. Subclasses of AbstractCell are meant to represent one element in one column of a DataPanel. For example, say we want our DataPanel to include a column of videos we have stored on disk. We want these videos to be lazily loaded using scikit-video, so we implement a VideoCell class as follows:

import meerkat as mk
import skvideo.io

class VideoCell(mk.AbstractCell):
    
    # What information will we eventually  need to materialize the cell? 
    def __init__(filepath: str):
        super().__init__()
        self.filepath = filepath
    
    # How do we actually materialize the cell?
    def get(self):
        return skvideo.io.vread(self.filepath)
    
    # What attributes should be written to disk on `VideoCell.write`?
    @classmethod
    def _state_keys(cls) -> Collection:
        return {"filepath"}

# We don't need to define a `VideoColumn` class and can instead just
# create a CellColumn fro a list of `VideoCell`
vid_column = mk.CellColumn(map(VideoCell, ["vid1.mp4", "vid2.mp4", "vid3.mp4"]))

Supported Columns

Meerkat ships with a number of core column types and the list is growing.

Core Columns

Column Description
ListColumn Flexible and can hold any type of data.
NumpyArrayColumn np.ndarray behavior for vectorized operations.
TensorColumn torch.tensor behavior for vectorized operations on the GPU.
ImageColumn Holds images stored on disk (e.g. as PNG or JPEG)
VideoColumn Holds videos stored on disk (e.g. as MP4)
MedicalVolumeColumn Optimized for medical images stored DICOM or NIFTI format.
SpacyColumn Holds processed text in spaCy Doc objects.
EmbeddingColumn Holds embeddings and provides utility methods like umap and build_faiss_index.
ClassificationOutputColumn Holds classifier predictions.
CellColumn Like ListColumn, but optimized for AbstractCell objects.

Contributed Columns

Column Supported Description
WILDSInputColumn Yes Build DataPanels for the WILDS benchmark.

About

Meerkat is being developed at Stanford's Hazy Research Lab. Please reach out to kgoel [at] cs [dot] stanford [dot] edu if you would like to use or contribute to Meerkat.

Comments
  • [BUG] from_pandas without reset_index

    [BUG] from_pandas without reset_index

    When using the meerkat from_pandas, things break if you just ran a filter a do not call reset_index(). You get an ambiguous key error when calling from_pandas . I would add some check with a better error message if a user has a non-sequential index of the dataframe.

    opened by seyuboglu 3
  • [WIP] Implement `BlockManager` backend

    [WIP] Implement `BlockManager` backend

    Overhaul the internals of the Meerkat DataPanel. The changes seek to enable:

    1. Vectorized row-wise operations (e.g. slicing, reduction)
    2. Simplified I/O and improved latency
    3. Clarified view vs. copy behavior
      • We introduce a new spec detailing when users should expect to get views vs. copies (similar to this resource for NumPy) – I'm working on enforcing this spec throughout the codebase.

    The new internals are based primarily off the BlockManager class, a dict-like object meant to replace the dictionary we were storing the DataPanel's columns in before. The BlockManager manages links between a DataPanel's columns and data blocks (AbstractBlock, NumpyBlock) where the data is actually stored. It implements consolidate, which takes columns of similar type in a DataPanel and stores their data together in a block, and apply which applies row-wise operations (e.g. getitem) to the blocks in a vectorized fashion. Other important classes:

    • BlockRef objects link a block with the BlockManager. These are critical to the functioning of the BlockManager and are the primary type of object passed between the blocks and the block manager. They consists of two things:
      1. A reference to the block (self.block)
      2. A set of columns in the BlockManager whose data live in the Block
    • BlockableMixin - a mixin used with AbstractColumn that holds references to a column's block and the columns index in the block
    • BlockView - a simple DataClass holding a block and an index into the block. It is typical for new columns to be created from BlockView

    Note: I marked this is a WIP because there are still a few more things to be done on this front.

    1. Make concat BlockManager aware

    Other major changes:

    • Removed visible_rows from AbstractColumn,
    • Removed _cloneable_kwargs in favor of a unified _clone, _copy, and _view module (cloneable.py)
    opened by seyuboglu 3
  • Make `DataPanel(dp)` return some shallow copied version of the original `dp`.

    Make `DataPanel(dp)` return some shallow copied version of the original `dp`.

    Issue

    It is very natural for users (and developers) to construct new DataPanel objects from existing ones via DataPanel(dp).

    Important Aside

    An unexpected consequence of this issue is finding a good way to stratify which attributes should be recomputed and which should simply be shallow copied over.

    As an example, two attributes that every DataPanel has is _data and _identifier. _data is typically large and heavy-weight, so we will almost always want to shallow copy it. _identifier is quite lightweight and may be unique to different DataPanels, so maybe this is a property we recompute each time in __init__. Note this is just an example, we may want the identifier to persist.

    This is especially relevant for subclassing DataPanel. As of PR #57, self.from_batch() is used to construct new DataPanel containers from existing ones with shared underlying data. However, as the PR mentions, self.from_batch() is called by many other ops (_get, merge, concat, etc.), and none of these methods have a seamless way of passing arguments other than data to __init__.

    An example of this is EntityDataPanel, where the index_column should be passed from the current instance to the newly constructed instance. Because there is no way to plumb that information through different calls, the initializer of EntityDataPanel gets called with EntityDataPanel(index_column=None) even if the current instance has an index column. This results in a new column "_ent_index" being added to the new EntityDataPanel.

    Proposed Solution 1

    Implement a private instance method called _clone(data=None, visible_columns=None...) -> DataPanel/subclass which implements the default functionality for how to construct a new DataPanel with the relevant arguments to plumb from current instance to new instance. We can then call self._clone(data=data, visible_columns-optional) instead of self.from_batch() in ops like _get, merge, concat, etc.

    Let's consider the EntityDataPanel case. We want to plumb self.index_column from a current EntityDataPanel to all EntityDataPanels constructed in its image. ._clone will look something like

    class EntityDataPanel:
        def _clone(self, data=None) -> EntityDataPanel:
            if data is None:
                data = self.data
            return EntityDataPanel(data, identifier=identifier, index_column=self.index_column)
    

    We can then have ops like DataPanel._get() for example use self._clone() instead of self.from_batch(). For example

    class DataPanel:
        def _get(self, idx, materialize=False):
            ...
            # example cases where `index` returns a datapanel
            elif isinstance(index, slice):
                # slice index => multiple row selection (DataPanel)
                # return self.from_batch(
                #    {
                #        k: self._data[k]._get(index, materialize=materialize)
                #        for k in self.visible_columns
                #    })
                return self._clone({
                    k: self._data[k]._get(index, materialize=materialize)
                    for k in self.visible_columns
                })
            ...
    

    Proposed Solution 2

    Instead of having developers reimplement ._clone(), we can have them implement something like _state_keys() but for init args. Something like ._clone_kwargs():

    class EntityDataPanel:
        def _clone_kwargs(self) -> EntityDataPanel:
            default_kwargs = super()._clone_kwargs()
            default_kwargs.update({"index_column": self.index_column})
            return default_kwargs
    
    class DataPanel:
        def _default_kwargs(self):
            return {"data": self.data, "identifier": self.identifier}
    
        def _clone(self, **kwargs):
            default_kwargs = self._clone_kwargs()
            if kwargs:
                default_kwargs.update(kwargs)
            return self.__class__(**default_kwargs)
    
    opened by ad12 3
  • [BUG] Indexing into DataPanel changes custom column type

    [BUG] Indexing into DataPanel changes custom column type

    Bug Description When indexing to get a subset of rows from a DataPanel with a complex custom column type, the type of that column is being changed to a ListColumn in the new subset DataPanel

    To Reproduce May be difficult to reproduce as it's only occurring for one custom column type that we have.

    1. Create complex custom column type (ours is a column where each cell is a time series with categorical values and subclasses mk.CellColumn)
    2. Create a DataPanel instance (dp) that has the above column and some data inside of it
    3. Index into the DataPanel (dp_subset = dp[0:1])
    4. The column type for that specific column in dp_subset has changed to a ListColumn

    System Information

    • OS: MacOS
    opened by dhatcher8 2
  • Add args, kwargs to ColumnIOMixin._read_data

    Add args, kwargs to ColumnIOMixin._read_data

    @krandiash enable this code to run without errors:

    import meerkat as mk import spacy

    nlp = spacy.load("en_core_web_sm") doc1 = nlp("Apple is looking at buying U.K. startup for $1 billion") doc2 = nlp("Hello there")

    dp = mk.DataPanel({ # 'text': ['The quick brown fox.', 'Jumped over.'], # 'spacy': mk.SpacyColumn([doc1, doc2]), 'list': [{}, {}] })

    dp.write('meerkat.dataset') dp2 = dp.read('meerkat.dataset', nlp=nlp)

    opened by jessevig 2
  • [FEATURE] Sort DataPanel by a column

    [FEATURE] Sort DataPanel by a column

    Add a sort function that can be used to sort the DataPanel by values in a column.

    dp = mk.DataPanel({'a': [1, 3, 2], 'b': ['a', 'c', 'b']})
    dp.sort('a') # sorted view into the dp
    
    opened by krandiash 2
  • Remove `visible_columns` from `DataPanel`

    Remove `visible_columns` from `DataPanel`

    DataPanels no longer rely on visible_columns to create views. This PR removes visible_columns entirely.

    Other changes:

    • Improve code coverage
      • Reactivate provenance tests
      • DataPanel batch tests
      • Concat tests
      • Merge tests
    • Remove Identifiers, Splits and Info from DataPanel and AbstractColumn
    opened by seyuboglu 2
  • [BUG] Appending along columns not working without suffix argument

    [BUG] Appending along columns not working without suffix argument

    Appending to a DataPanel along columns does not work without suffix argument even when the column names do not overlap.

    dp = ms.DataPanel({
        'text': ['The quick brown fox.', 'Jumped over.', 'The lazy dog.'],
        'label': [0, 1, 0]
    })
    dp2 = ms.DataPanel({
        'string': ['The quick brown fox.', 'Jumped over.', 'The lazy dog.'],
        'target': [0, 1, 0]
    })
    dp.append(dp2, axis=1)
    

    This code throws ValueError. It works when I provide any suffix, although they are not used.

    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    <ipython-input-18-5f32282aa054> in <module>()
    ----> 1 dp.append(dp2, axis=1)
    
    1 frames
    /usr/local/lib/python3.7/dist-packages/mosaic/datapanel.py in append(self, dp, axis, suffixes, overwrite)
        422             if not overwrite and shared:
        423                 if suffixes is None:
    --> 424                     raise ValueError()
        425                 left_suf, right_suf = suffixes
        426                 data = {
    
    ValueError:
    
    opened by Priya2698 2
  • V1 Entity Data Panel

    V1 Entity Data Panel

    Adds entity data panel in pipelines folder. Core ideas

    • Data panel that has zero or more embedding columns
    • Data panel has index panel for functions like iget and icontains for the unique entity id
    • Supports appending, from_datapanel, and other data panel methods
    • Supports embedding based functions (e.g., cosine nearest neighbors) that returns the metadata.
    opened by lorr1 2
  • Dean/174 rename

    Dean/174 rename

    This PR has two parts:

    1. Implementing the DataFrame.rename function. This is an out of place operation that accepts a Dict or Callable mapper argument similar to Pandas. We only support renaming columns since renaming indexes is not applicable for meerkat.

    2. Fixing some issues with the Colab notebook. These mostly had to do with the way imagenette was being downloaded. There are still some minor issues here:

      • The imagenette.csv file has image paths such as train/n02979186/n02979186_9036.JPEG instead of imagenette2-160/train/n02979186/n02979186_9036.JPEG. The solution should probably be to go through the .csv file and add "imagenette2-160" before every line (I believe the file paths are coming straight from imagenette download?).
      • Should the files in the downloads folder be deleted after extracted?
    opened by dastratakos 1
  • [BUG] Downloading imagenet does not work

    [BUG] Downloading imagenet does not work

    Describe the bug Downloading imagenet like this does not work:

    dp = mk.datasets.get(
        "imagenet", 
        dataset_dir="/home/ec2-user/data/imagenet1k",
        download=True,
    )
    

    Fails with FileNotFoundError: [Errno 2] No such file or directory: '/home/ec2-user/data/imagenet1k/ILSVRC/ImageSets/CLS-LOC/train_cls.txt'

    To Reproduce Steps and code snippet that reproduce the behavior:

    1. Code snippet '....'
    2. Instructions (Run '...')
    3. Errors and traceback '....'

    See above

    Include any relevant screenshots.

    Expected behavior Should download imagenet1k to the specified dataset_dir.

    System Information

    • OS: Linux
    • Versions for RG and relevant dependencies meerkat-ml (latest)

    Additional context Add any other context about the problem here.

    opened by MaxFBurg 1
  • [BUG] Quickstart is not working - No module named 'meerkat.contrib'

    [BUG] Quickstart is not working - No module named 'meerkat.contrib'

    Describe the bug Cannot run Quick start successfully, Import prudces Error: ModuleNotFoundError: No module named 'meerkat.contrib'

    To Reproduce

    import meerkat as mk
    from meerkat.contrib.imagenette import download_imagenette
    

    Expected behavior Quickstart runs succesfully

    System Information

    • OS: Ubuntu 18.04.5 LTS
    • meerkat-ml (v0.2.5)
    opened by butterkaffee 0
  • [BUG] deepcopy corrupts block manager

    [BUG] deepcopy corrupts block manager

    A call to copy.deepcopy on a datapanel corrupts _block_index of the columns:

    dp = mk.DataPanel({
        "a": pd.Series([0,1,2,3]),
        "b": pd.Series([0,1,2,3]),
        "c": pd.Series([0,1,2,3]),
    })
    dp.consolidate()
    print(dp["a"]._block_index)
    
    import copy
    
    dp = copy.deepcopy(dp)
    print(dp["a"]._block_index)
    
    opened by seyuboglu 0
  • [BUG] Check for empty examples in AudioSet

    [BUG] Check for empty examples in AudioSet

    There are some examples in audioset who's start time and end time are outside of the length of the video. For example,

    balanced_train_segments/YTID=kKf9OprN9nw_st=400.0_et=410.wav ```
    
    When creating the Audioset DataPanel we should check for this and remove those rows. 
    opened by seyuboglu 0
  • [FEATURE] Add caching functionality to LambdaColumn

    [FEATURE] Add caching functionality to LambdaColumn

    I’m envisioning is something in between a map and a LambdaColumn where the computation happens lazily but is cached once it’s computed. Right now, it’s either you do it all up front or you don’t get caching.

    This idea was raised @ANarayan who pointed out that it would be helpful for caching feature preprocessing in NLP pipelines.

    opened by seyuboglu 0
Releases(v0.2.5)
  • v0.2.5(Jul 22, 2022)

    What's Changed

    • Release: v0.2.2 by @krandiash in https://github.com/robustness-gym/meerkat/pull/191
    • Release: v0.2.3 by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/200
    • Audioset DataPanel by @Priya2698 in https://github.com/robustness-gym/meerkat/pull/229
    • Fix issue where old datapanels are missing have formatter state by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/233
    • Make AudioSet DataPanels relational by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/235
    • Add coco, mir, and pascal by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/239
    • Make write only write columns in datapanel by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/240
    • Enforce contiguous index in pandas columns by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/244
    • Fix issue where ray pickle fails on lazy loader by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/245
    • Feature/groupby basegroupby by @sam-randall in https://github.com/robustness-gym/meerkat/pull/242
    • Reorganize the implementation of datasets by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/246
    • Add support for persistent configuration by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/247
    • Implement sort for data panel and columns by @hannahkim24 in https://github.com/robustness-gym/meerkat/pull/237
    • Add emb module by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/249
    • Reorganize ops code by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/250
    • Add sample by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/251
    • Add several HAPI datasets by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/252
    • Update styling of docs by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/253
    • Bump to version Release: vx.y.z by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/254

    New Contributors

    • @sam-randall made their first contribution in https://github.com/robustness-gym/meerkat/pull/242
    • @hannahkim24 made their first contribution in https://github.com/robustness-gym/meerkat/pull/237

    Full Changelog: https://github.com/robustness-gym/meerkat/compare/v0.2.4...v0.2.5

    Source code(tar.gz)
    Source code(zip)
  • v0.2.4(Feb 17, 2022)

    What's Changed

    • Update contributing to support new dev main structure by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/203
    • Add args, kwargs to ColumnIOMixin._read_data by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/204
    • Fix from_huggingface and add tests by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/205
    • Minor fix by @khaledsaab in https://github.com/robustness-gym/meerkat/pull/206
    • Add downloader to ImageColumn by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/207
    • Remove default addition of index by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/208
    • Add DEW contrib to registry by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/209
    • Catch ConnectionResetError by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/210
    • Add inaturalist to contrib by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/211
    • Fix issue where arraycolumns can't be saved with jsonlines by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/214
    • Update the docs and add user guide. by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/215
    • Add contrib for enron email dataset by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/217
    • Fix PIL attribute error on list and lambda column representations by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/218
    • mmap path bug fix by @khaledsaab in https://github.com/robustness-gym/meerkat/pull/219
    • Downgrade pytorch dependency bound by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/220
    • Fix issue with subclassing DataPanel._state_keys by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/224
    • Use multiple slices instead of pa.Table.take in ArrowBlock by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/226
    • Fix issue where boolean list can't index by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/227
    • Add support for AudioColumn by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/222
    • Add waterbirds contrib by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/228
    • Add guide to indexing and slicing by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/225
    • Docs/build fix by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/230
    • Bump version by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/231

    Full Changelog: https://github.com/robustness-gym/meerkat/compare/v0.2.3...v0.2.4

    Source code(tar.gz)
    Source code(zip)
  • v0.2.3(Nov 19, 2021)

    What's Changed

    • Release: v0.2.1 by @krandiash in https://github.com/robustness-gym/meerkat/pull/171
    • Bump version to 0.2.2 by @krandiash in https://github.com/robustness-gym/meerkat/pull/190
    • Delete nn by @krandiash in https://github.com/robustness-gym/meerkat/pull/192
    • Add support for loading train and test set in cifar10" by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/193
    • Fix issue where tensor columns can't be indexed with pandas series by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/195
    • Update cifar10 to support test set too by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/196
    • Fix bacckwards compat issue with base_dir and GCSImageColumn by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/197
    • Support backwards compatibility with nn by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/198
    • Bump version by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/199

    Full Changelog: https://github.com/robustness-gym/meerkat/compare/v0.2.2...v0.2.3

    Source code(tar.gz)
    Source code(zip)
  • v0.2.2(Nov 12, 2021)

    What's Changed

    • Release: v0.2.0 by @krandiash in https://github.com/robustness-gym/meerkat/pull/120
    • Callbacks by @Priya2698 in https://github.com/robustness-gym/meerkat/pull/168
    • Add support for ArrowArrayColumns by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/173
    • Add dataset registry by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/176
    • Make logging initialization robust to permissions by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/179
    • Fix datasets download bug by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/180
    • Add support for datasets.names and datasets.catalog by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/181
    • Update celeba download by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/183
    • Add support for base_dir in image column by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/184
    • Add meerkatloader for loading meerkat modules from yaml file by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/185
    • Fix issue where datapanel visualizations only show floats by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/186
    • Move readme assets by @seyuboglu in https://github.com/robustness-gym/meerkat/pull/187
    • Update Spacy column by @krandiash in https://github.com/robustness-gym/meerkat/pull/189

    Full Changelog: https://github.com/robustness-gym/meerkat/compare/v0.2.1...v0.2.2

    Source code(tar.gz)
    Source code(zip)
Owner
Robustness Gym
Building tools for evaluating and repairing ML models.
Robustness Gym
GAM timeseries modeling with auto-changepoint detection. Inspired by Facebook Prophet and implemented in PyMC3

pm-prophet Pymc3-based universal time series prediction and decomposition library (inspired by Facebook Prophet). However, while Faceook prophet is a

Luca Giacomel 314 Dec 25, 2022
Customers Segmentation with RFM Scores and K-means

Customer Segmentation with RFM Scores and K-means RFM Segmentation table: K-Means Clustering: Business Problem Rule-based customer segmentation machin

5 Aug 10, 2022
Machine learning template for projects based on sklearn library.

Machine learning template for projects based on sklearn library.

Janez Lapajne 17 Oct 28, 2022
Distributed deep learning on Hadoop and Spark clusters.

Note: we're lovingly marking this project as Archived since we're no longer supporting it. You are welcome to read the code and fork your own version

Yahoo 1.3k Dec 28, 2022
A library of sklearn compatible categorical variable encoders

Categorical Encoding Methods A set of scikit-learn-style transformers for encoding categorical variables into numeric by means of different techniques

2.1k Jan 07, 2023
My capstone project for Udacity's Machine Learning Nanodegree

MLND-Capstone My capstone project for Udacity's Machine Learning Nanodegree Lane Detection with Deep Learning In this project, I use a deep learning-b

Michael Virgo 407 Dec 12, 2022
Python module for performing linear regression for data with measurement errors and intrinsic scatter

Linear regression for data with measurement errors and intrinsic scatter (BCES) Python module for performing robust linear regression on (X,Y) data po

Rodrigo Nemmen 56 Sep 27, 2022
Sequence learning toolkit for Python

seqlearn seqlearn is a sequence classification toolkit for Python. It is designed to extend scikit-learn and offer as similar as possible an API. Comp

Lars 653 Dec 27, 2022
A machine learning project that predicts the price of used cars in the UK

Car Price Prediction Image Credit: AA Cars Project Overview Scraped 3000 used cars data from AA Cars website using Python and BeautifulSoup. Cleaned t

Victor Umunna 7 Oct 13, 2022
Distributed Computing for AI Made Simple

Project Home Blog Documents Paper Media Coverage Join Fiber users email list Uber Open Source 997 Dec 30, 2022

MooGBT is a library for Multi-objective optimization in Gradient Boosted Trees.

MooGBT is a library for Multi-objective optimization in Gradient Boosted Trees. MooGBT optimizes for multiple objectives by defining constraints on sub-objective(s) along with a primary objective. Th

Swiggy 66 Dec 06, 2022
A framework for building (and incrementally growing) graph-based data structures used in hierarchical or DAG-structured clustering and nearest neighbor search

A framework for building (and incrementally growing) graph-based data structures used in hierarchical or DAG-structured clustering and nearest neighbor search

Nicholas Monath 31 Nov 03, 2022
Mortality risk prediction for COVID-19 patients using XGBoost models

Mortality risk prediction for COVID-19 patients using XGBoost models Using demographic and lab test data received from the HM Hospitales in Spain, I b

1 Jan 19, 2022
Software Engineer Salary Prediction

Based on 2021 stack overflow data, this machine learning web application helps one predict the salary based on years of experience, level of education and the country they work in.

Jhanvi Mimani 1 Jan 08, 2022
Transform ML models into a native code with zero dependencies

m2cgen (Model 2 Code Generator) - is a lightweight library which provides an easy way to transpile trained statistical models into a native code

Bayes' Witnesses 2.3k Jan 03, 2023
Microsoft Machine Learning for Apache Spark

Microsoft Machine Learning for Apache Spark MMLSpark is an ecosystem of tools aimed towards expanding the distributed computing framework Apache Spark

Microsoft Azure 3.9k Dec 30, 2022
Kalman filter library

The kalman filter framework described here is an incredibly powerful tool for any optimization problem, but particularly for visual odometry, sensor fusion localization or SLAM.

comma.ai 276 Jan 01, 2023
李航《统计学习方法》复现

本项目复现李航《统计学习方法》每一章节的算法 特点: 笔记摘要:在每个文件开头都会有一些核心的摘要 pythonic:这里会用尽可能规范的方式来实现,包括编程风格几乎严格按照PEP8 循序渐进:前期的算法会更list的方式来做计算,可读性比较强,后期几乎完全为numpy.array的计算,并且辅助详

58 Oct 22, 2021
A Multipurpose Library for Synthetic Time Series Generation in Python

TimeSynth Multipurpose Library for Synthetic Time Series Please cite as: J. R. Maat, A. Malali, and P. Protopapas, “TimeSynth: A Multipurpose Library

278 Dec 26, 2022
Pydantic based mock data generation

This library offers powerful mock data generation capabilities for pydantic based models. It can also be used with other libraries that use pydantic as a foundation, for example SQLModel, Beanie and

Na'aman Hirschfeld 396 Dec 28, 2022