Python Package for DataHerb: create, search, and load datasets.

Overview


Markdownify
The Python Package for DataHerb

A DataHerb Core Service to Create and Load Datasets.

Install

pip install dataherb

Documentation: dataherb.github.io/dataherb-python

The DataHerb Command-Line Tool

Requires Python 3

The DataHerb cli provides tools to create dataset metadata, validate metadata, search dataset in flora, and download dataset.

Search and Download

Search by keyword

dataherb search covid19
# Shows the minimal metadata

Search by dataherb id

dataherb search -i covid19_eu_data
# Shows the full metadata

Download dataset by dataherb id

dataherb download covid19_eu_data
# Downloads this dataset: http://dataherb.io/flora/covid19_eu_data

Create Dataset Using Command Line Tool

We provide a template for dataset creation.

Within a dataset folder where the data files are located, use the following command line tool to create the metadata template.

dataherb create

Upload dataset to remote

Within the dataset folder, run

dataherb upload

UI for all the datasets in a flora

dataherb serve

Use DataHerb in Your Code

Load Data into DataFrame

# Load the package
from dataherb.flora import Flora

# Initialize Flora service
# The Flora service holds all the dataset metadata
use_flora = "path/to/my/flora.json"
dataherb = Flora(flora=use_flora)

# Search datasets with keyword(s)
geo_datasets = dataherb.search("geo")
print(geo_datasets)

# Get a specific file from a dataset and load as DataFrame
tz_df = pd.read_csv(
  dataherb.herb(
      "geonames_timezone"
  ).get_resource(
      "dataset/geonames_timezone.csv"
  )
)
print(tz_df)

The DataHerb Project

What is DataHerb

DataHerb is an open-source data discovery and management tool.

  • A DataHerb or Herb is a dataset. A dataset comes with the data files, and the metadata of the data files.
  • A Herb Resource or Resource is a data file in the DataHerb.
  • A Flora is the combination of all the DataHerbs.

In many data projects, finding the right datasets to enhance your data is one of the most time consuming part. DataHerb adds flavor to your data project. By creating metadata and manage the datasets systematically, locating an dataset is much easier.

Currently, dataherb supports sync dataset between local and S3/git. Each dataset can have its own remote location.

What is DataHerb Flora

We desigined the following workflow to share and index open datasets.

DataHerb Workflow

The repo dataherb-flora is a demo flora that lists some datasets and demonstrated on the website https://dataherb.github.io. At this moment, the whole system is being renovated.

Development

  1. Create a conda environment.
  2. Install requirements: pip install -r requirements.txt

Documentation

The source of the documentation for this package is located at docs.

References and Acknolwedgement

  • dataherb uses datapackage in the core. datapackage is a python library for the data-package standard. The core schema of the dataset is essentially the data-package standard.
Comments
  • would you like to take a look at our api?

    would you like to take a look at our api?

    I come across this repo and found it very similar to our API, though much more mature. https://github.com/Glacier-Ice/data-sci-api

    we have problems in creating a standard of dataset collection and API documentation for end-users

    is there a way we can collaborate?

    opened by Stockard 4
  • Format search results for better ux

    Format search results for better ux

    The current search result shows too much information. It would be good to format the result into a way that is easier to read and get the id if needed.

    enhancement 
    opened by emptymalei 1
  • use rapidfuzz instead of fuzzywuzzy

    use rapidfuzz instead of fuzzywuzzy

    FuzzyWuzzy is GPLv2 licensed which would force you to licence the whole project under GPLv2. I had the same problem on one of my projects and so I wrote rapidfuzz which is implementing the same algorithm but is based on a version of fuzzywuzzy that was MIT Licensed and is therefor MIT Licensed aswell, so it can be used in here without forcing a License change. As a nice bonus it is fully implemented in C++ and comes with a few Algorithmic improvements making it faster than FuzzyWuzzy.

    opened by maxbachmann 1
  • Use One File for Each Herb in Flora

    Use One File for Each Herb in Flora

    Is it better to have one file for each herb in flora?

    Situition

    Currently, the flora is defined in a single json file.

    • It becomes hard to read. This is not fitting into the human-readable principle.
    • It becomes hard to manage. We are currently sorting everything in the big file. When we have a problem, the whole flora will be unusable.

    Solution

    Use separate files for herbs.

    Simply Copy dataherb.json

    • Copy dataherb.json to workdir/{id}/dataherb.json or {id}.json will work.

      • Using folders allows us to put in more files. For example, we can take datapackage content out to make it more managable.
    • Build the flora from all these files.

    • [x] Implement this new structure.

    Ready for a Demo repo of flora

    In this way, we can put up a repo for open datasets easily and allow users to add more easily.

    Possible creating process

    • Create package directly on GitHub by uploading the dataherb.json file.

      • But there should be a validation process to avoid duplicate id.
    • [ ] Setup a demo repo as demo flora.

    enhancement 
    opened by emptymalei 0
  • Overhaul: New Core Management, Local Indexing Webpage, Flexible Flora Database

    Overhaul: New Core Management, Local Indexing Webpage, Flexible Flora Database

    This is a completely new era of Dataherb.

    New Stuff

    • Supporting S3 as source
    • Serve whole flora as webpages with search
    • User config for flora
    • Multiple flora on one machine

    We also redesigned the core.

    opened by emptymalei 0
  • Add dataset using the URL of a remote repo

    Add dataset using the URL of a remote repo

    We don't only upload datasets, we might also want to load datasets from remote.

    Here we propose to add the option to add datasets using the URL.

    • Build a Herb from remote data
    • Option to add metadata only or download everything.
      • Adding metadata only will only add data to the flora
      • Thus we can not find the dataset folder with the corresponding id.
      • This can be used to decide if a dataset is metadata only or fully downloaded.
    opened by emptymalei 0
  • Sync Flora Metafolder

    Sync Flora Metafolder

    Managing flora using command line

    Version control of the flora is not really hard. We just get into the folder and use git.

    But it would be much easier if we can simply run dataherb sync flora


    Approaches:

    enhancement 
    opened by emptymalei 0
Releases(0.1.6)
  • 0.1.6(Feb 10, 2022)

    Fixed

    • Command line tool dataherb configure -l now only opens the folder.
    • Command line too dataherb download will also display where the dataset is downloaded to. This makes it easier for the user to find the downloaded dataset.
    Source code(tar.gz)
    Source code(zip)
  • 0.1.5(Aug 12, 2021)

    Using Dedicated Folders for Herbs

    In the previous versions, we can only use a single file to host all the flora metadata. It will become unmanageable and hard to read as the number of herbs grows. (#14)

    In this version, we introduce a new structure for the flora metadata. Each herb is getting its own folder! This structure makes it easier for us to read and manage by hand. It is also better for version-controling your flora.

    (๐ŸŒฑ Best wishes to your herbs in their own pots. )

    Source code(tar.gz)
    Source code(zip)
  • 0.1.4(Aug 7, 2021)

  • 0.1.3(Aug 7, 2021)

  • 0.0.5(Mar 14, 2020)

  • 0.0.3(Feb 23, 2020)

    dataherb command line tool now automatically finds the data files and generate part of the metadata based on the files. CSV files are automatically parsed.

    Source code(tar.gz)
    Source code(zip)
Owner
DataHerb
Get datasets in a blink of an eye | Experimenting with simple modular small dataset discovery
DataHerb
Methylation/modified base calling separated from basecalling.

Remora Methylation/modified base calling separated from basecalling. Remora primarily provides an API to call modified bases for basecaller programs s

Oxford Nanopore Technologies 72 Jan 05, 2023
CRISP: Critical Path Analysis of Microservice Traces

CRISP: Critical Path Analysis of Microservice Traces This repo contains code to compute and present critical path summary from Jaeger microservice tra

Uber Research 110 Jan 06, 2023
Program that predicts the NBA mvp based on data from previous years.

NBA MVP Predictor A machine learning model using RandomForest Regression that predicts NBA MVP's using player data. Explore the docs ยป View Demo ยท Rep

Muhammad Rabee 1 Jan 21, 2022
Tokyo 2020 Paralympics, Analytics

Tokyo 2020 Paralympics, Analytics Thanks for checking out my app! It was built entirely using matplotlib and Tokyo 2020 Paralympics data. This applica

Petro Ivaniuk 1 Nov 18, 2021
Anomaly Detection with R

AnomalyDetection R package AnomalyDetection is an open-source R package to detect anomalies which is robust, from a statistical standpoint, in the pre

Twitter 3.5k Dec 27, 2022
Meltano: ELT for the DataOps era. Meltano is open source, self-hosted, CLI-first, debuggable, and extensible.

Meltano is open source, self-hosted, CLI-first, debuggable, and extensible. Pipelines are code, ready to be version c

Meltano 625 Jan 02, 2023
PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift

Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift This project is composed of two parts: Part1 and Part2

Emmanuel Boateng Sifah 1 Jan 19, 2022
A Python package for the mathematical modeling of infectious diseases via compartmental models

A Python package for the mathematical modeling of infectious diseases via compartmental models. Originally designed for epidemiologists, epispot can be adapted for almost any type of modeling scenari

epispot 12 Dec 28, 2022
A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

This tutorial's purpose is to introduce Pythonistas to methods for scaling their data science and machine learning work to larger datasets and larger models, using the tools and APIs they know and lo

Coiled 102 Nov 10, 2022
API>local_db>AWS_RDS - Disclaimer! All data used is for educational purposes only.

APIlocal_dbAWS_RDS Disclaimer! All data used is for educational purposes only. ETL pipeline diagram. Aim of project By creating a fully working pipe

0 Apr 25, 2022
fds is a tool for Data Scientists made by DAGsHub to version control data and code at once.

Fast Data Science, AKA fds, is a CLI for Data Scientists to version control data and code at once, by conveniently wrapping git and dvc

DAGsHub 359 Dec 22, 2022
follow-analyzer helps GitHub users analyze their following and followers relationship

follow-analyzer follow-analyzer helps GitHub users analyze their following and followers relationship by providing a report in html format which conta

Yin-Chiuan Chen 2 May 02, 2022
Weather analysis with Python, SQLite, SQLAlchemy, and Flask

Surf's Up Weather analysis with Python, SQLite, SQLAlchemy, and Flask Overview The purpose of this analysis was to examine weather trends (precipitati

Art Tucker 1 Sep 05, 2021
2019 Data Science Bowl

Kaggle-2019-Data-Science-Bowl-Solution - Here i present my solution to kaggle 2019 data science bowl and how i improved it to win a silver medal in that competition.

Deepak Nandwani 1 Jan 01, 2022
Python package for analyzing behavioral data for Brain Observatory: Visual Behavior

Allen Institute Visual Behavior Analysis package This repository contains code for analyzing behavioral data from the Allen Brain Observatory: Visual

Allen Institute 16 Nov 04, 2022
Synthetic data need to preserve the statistical properties of real data in terms of their individual behavior and (inter-)dependences

Synthetic data need to preserve the statistical properties of real data in terms of their individual behavior and (inter-)dependences. Copula and functional Principle Component Analysis (fPCA) are st

32 Dec 20, 2022
Monitor the stability of a pandas or spark dataframe โš™๏ธŽ

Population Shift Monitoring popmon is a package that allows one to check the stability of a dataset. popmon works with both pandas and spark datasets.

ING Bank 403 Dec 07, 2022
Spaghetti: an open-source Python library for the analysis of network-based spatial data

pysal/spaghetti SPAtial GrapHs: nETworks, Topology, & Inference Spaghetti is an open-source Python library for the analysis of network-based spatial d

Python Spatial Analysis Library 203 Jan 03, 2023
Python implementation of Principal Component Analysis

Principal Component Analysis Principal Component Analysis (PCA) is a dimension-reduction algorithm. The idea is to use the singular value decompositio

Ignacio Darago 1 Nov 06, 2021
Detailed analysis on fraud claims in insurance companies, gives you information as to why huge loss take place in insurance companies

Insurance-Fraud-Claims Detailed analysis on fraud claims in insurance companies, gives you information as to why huge loss take place in insurance com

1 Jan 27, 2022