Python Implementation of Scalable In-Memory Updatable Bitmap Indexing

Related tags

Data AnalysisPyUpBit
Overview

Contributors Forks Stargazers Issues LinkedIn


PyUpBit

CS490 Large Scale Data Analytics — Implementation of Updatable Compressed Bitmap Indexing
Paper

Table of Contents
  1. About The Project
  2. Usage
  3. Contact
  4. Acknowledgements

About The Project

Bitmaps are common data structures used in database implemen- tations due to having fast read performance. Often they are used in applications in need of common equality and selective range queries. Essentially, they store a bit-vector for each value in the domain of each attribute to keep track of large scale data files. How- ever, the main drawbacks associated with bitmap indexes are its encoding and decoding performances of bit-vectors. Currently the state of art update-optimized bitmap index, update conscious bitmaps, are able to support extremely efficient deletes and have improved update speeds by treating updates as delete then insert. Update conscious bitmaps make use of an additional bit-vector, called the existence bit-vector, to keep track of whether or not a value has been updated. By initializing all values of the existence bit-vector to 1, the data for each attribute associated with each row in the existence bit-vector is validated and presented. If a value needs to be deleted, the corresponding row in the existence bit-vector gets changed to 0, invalidating any data associated with that row. This new method in turn allows for very efficient deletes. To add on, updates are then performed as a delete operation, then an insert operation in to the end of the bit-vector. However, update conscious bitmaps do not scale well with more data. As more and more data gets updated and inserted, the run time increases significantly as well. Because update queries are out-of- place and increase size of vectors, read queries become increasingly expensive and time consuming. Furthermore, as the number of updates and deletes increases, the bit-vector becomes less and less compressible. This brings us to updateable Bitmaps (UpBit). According to the paper, UpBit: Scalable In-Memory Updatable Bitmap Indexing, re- searchers Manos Athanassoulis, Zheng Yan, and Stratos Idreos developed a new bitmap structure that improved the write per- formance of bitmaps without sacrificing read performance. The main differentiating point of UpBit is its use of an update bit vector for every value in the domain of an attribute that keeps track of updated values. This allows for faster write performance without sacrificing read performance. Based on this paper, we implemented UpBit and compared it to our implementation of update conscious bitmaps to compare and test the performances of both methods.

Usage

We used PyCharm to conduct our tests, /ucb, /upbit for algorithms, /tests for running testing scripts, and rest of the files for compression for memory usage improvement as well as creating and visualizing data.

Contact

Daniel Park - @h1yung - [email protected]

Acknowledgements

  • Original Paper
  • Winston Chen
  • Gregory Chininis
  • Daniel Hooks
  • Michael Lee
Owner
Hyeong Kyun (Daniel) Park
I like coding
Hyeong Kyun (Daniel) Park
Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis. You write a high level configuration file specifying your in

Blue Collar Bioinformatics 917 Jan 03, 2023
Approximate Nearest Neighbor Search for Sparse Data in Python!

Approximate Nearest Neighbor Search for Sparse Data in Python! This library is well suited to finding nearest neighbors in sparse, high dimensional spaces (like text documents).

Meta Research 906 Jan 01, 2023
PyClustering is a Python, C++ data mining library.

pyclustering is a Python, C++ data mining library (clustering algorithm, oscillatory networks, neural networks). The library provides Python and C++ implementations (C++ pyclustering library) of each

Andrei Novikov 1k Jan 05, 2023
Intake is a lightweight package for finding, investigating, loading and disseminating data.

Intake: A general interface for loading data Intake is a lightweight set of tools for loading and sharing data in data science projects. Intake helps

Intake 851 Jan 01, 2023
Codes for the collection and predictive processing of bitcoin from the API of coinmarketcap

Codes for the collection and predictive processing of bitcoin from the API of coinmarketcap

Teo Calvo 5 Apr 26, 2022
WithPipe is a simple utility for functional piping in Python.

A utility for functional piping in Python that allows you to access any function in any scope as a partial.

Michael Milton 1 Oct 26, 2021
Show you how to integrate Zeppelin with Airflow

Introduction This repository is to show you how to integrate Zeppelin with Airflow. The philosophy behind the ingtegration is to make the transition f

Jeff Zhang 11 Dec 30, 2022
SNV calling pipeline developed explicitly to process individual or trio vcf files obtained from Illumina based pipeline (grch37/grch38).

SNV Pipeline SNV calling pipeline developed explicitly to process individual or trio vcf files obtained from Illumina based pipeline (grch37/grch38).

East Genomics 1 Nov 02, 2021
Statistical & Probabilistic Analysis of Store Sales, University Survey, & Manufacturing data

Statistical_Modelling Statistical & Probabilistic Analysis of Store Sales, University Survey, & Manufacturing data Statistical Methods for Decision Ma

Avnika Mehta 1 Jan 27, 2022
pipeline for migrating lichess data into postgresql

How Long Does It Take Ordinary People To "Get Good" At Chess? TL;DR: According to 5.5 years of data from 2.3 million players and 450 million games, mo

Joseph Wong 182 Nov 11, 2022
statDistros is a Python library for dealing with various statistical distributions

StatisticalDistributions statDistros statDistros is a Python library for dealing with various statistical distributions. Now it provides various stati

1 Oct 03, 2021
DenseClus is a Python module for clustering mixed type data using UMAP and HDBSCAN

DenseClus is a Python module for clustering mixed type data using UMAP and HDBSCAN. Allowing for both categorical and numerical data, DenseClus makes it possible to incorporate all features in cluste

Amazon Web Services - Labs 53 Dec 08, 2022
Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

2 Nov 20, 2021
Get mutations in cluster by querying from LAPIS API

Cluster Mutation Script Get mutations appearing within user-defined clusters. Usage Clusters are defined in the clusters dict in main.py: clusters = {

neherlab 1 Oct 22, 2021
Pipeline to convert a haploid assembly into diploid

HapDup (haplotype duplicator) is a pipeline to convert a haploid long read assembly into a dual diploid assembly. The reconstructed haplotypes

Mikhail Kolmogorov 50 Jan 05, 2023
DataPrep — The easiest way to prepare data in Python

DataPrep — The easiest way to prepare data in Python

SFU Database Group 1.5k Dec 27, 2022
General Assembly's 2015 Data Science course in Washington, DC

DAT8 Course Repository Course materials for General Assembly's Data Science course in Washington, DC (8/18/15 - 10/29/15). Instructor: Kevin Markham (

Kevin Markham 1.6k Jan 07, 2023
Bamboolib - a GUI for pandas DataFrames

Community repository of bamboolib bamboolib is joining forces with Databricks. For more information, please read our announcement. Please note that th

Tobias Krabel 863 Jan 08, 2023
Python implementation of Principal Component Analysis

Principal Component Analysis Principal Component Analysis (PCA) is a dimension-reduction algorithm. The idea is to use the singular value decompositio

Ignacio Darago 1 Nov 06, 2021
CINECA molecular dynamics tutorial set

High Performance Molecular Dynamics Logging into CINECA's computer systems To logon to the M100 system use the following command from an SSH client ss

J. W. Dell 0 Mar 13, 2022