MongoDB utility to inflate the contents of small collection to a new larger collection

Overview

MongoDB Data Inflater ("data-inflater")

The data-inflater tool is a MongoDB utility to automate the creation of a new large database collection using data sourced from an existing smaller database collection.

By default, the utility will use the Atlas 'sample data set' database collection sample_mflix.movies as the source collection. However, most users will provide parameters to the utility to specify the use of their own database and source collection. If you do want to use the Atlas sample data set, see the sample data manual page for more information.

The data-inflater utility issues multiple concurrent aggregation processes, each copying batches of records in parallel for increased performance. The resulting collection will contain documents with duplicated data but with new unique _id field values. The variance ratio of data in the new collection will approximately reflect the variance ratio of the source collection. Therefore, you should ensure you have supplied at least a few different documents (if not a few hundred or thousand) in the source collection.

If you are running a sharded cluster, the utility will ensure the target collection is sharded with a shard key, and where it can, it will pre-split the chunks to avoid subsequent needless balancer overhead. For example, if you specify the --shardkey parameter for this utility to reference a field (e.g. product_name) as the range based shard key, before creating the target collection, the utility will introspect the spread of values for the shard key field (e.g. product_name). The utility will then create pre-split chunks in the new empty target collection before any data is copied to it, to maximise performance.

How To Run

In a running MongoDB cluster (self-managed or running in Atlas), ensure you have created and populated a source collection with at least one sample record in it (ideally more with varying values for the fields across the different documents to reflect the shape and variance you desire).

Ensure Python3 (version 3.8 or greater) and the MongoDB Python Driver (PyMongo) are already installed on your workstation. Example to install PyMongo:

pip3 install --user pymongo

Ensure the .py script is executable and then execute the following to view the utility's help instructions and the full list of parameters that you can provide:

./data-inflater.py -h

Execute the following to connect to a locally running single server database (default port) to copy and expand the data from an existing source collection, mydb.mySrcColl, to an a new collection, mydb.myDestColl, which will contain 1 million records:

./data-inflater.py --url 'mongodb://localhost:27017' -d 'mydb' -c 'mySrcColl' -t 'myDestColl' -s 1000000

Execute the following to connect to an Atlas cluster (ensure you've already loaded the Atlas sample data set), to inflate the data from the source movies collection to the new movies_big collection, which will contain 100 million records (note, first change the URL username, password and hostname shown, to match the URL of your Atlas cluster):

./data-inflater.py --url 'mongodb+srv://usr:[email protected]/'
Owner
Paul Done
Paul Done
Lock files using python and cmd

Python_Lock_Files Lock files using python and cmd license feel free to do whatever you want to with these files, i dont take any responsibility tho, u

1 Nov 01, 2021
🌲 A simple BST (Binary Search Tree) generator written in python

Tree-Traversals (BST) 🌲 A simple BST (Binary Search Tree) generator written in python Installation Use the package manager pip to install BST. Usage

Jan Kupczyk 1 Dec 12, 2021
A collection of common regular expressions bundled with an easy to use interface.

CommonRegex Find all times, dates, links, phone numbers, emails, ip addresses, prices, hex colors, and credit card numbers in a string. We did the har

Madison May 1.5k Dec 31, 2022
A simple package for handling variables in string.

A simple package for handling string variables. Welcome! This is a simple package for handling variables in string, You can add or remove variables wi

1 Dec 31, 2021
More routines for operating on iterables, beyond itertools

More Itertools Python's itertools library is a gem - you can compose elegant solutions for a variety of problems with the functions it provides. In mo

2.9k Jan 06, 2023
A workflow management tool for numerical models on the NCI computing systems

Payu Payu is a climate model workflow management tool for supercomputing environments. Payu is currently only configured for use on computing clusters

The Payu Organization 11 Aug 25, 2022
Local backup made easy, with Python and shutil

KTBackup BETA Local backup made easy, with Python and shutil Features One-command backup and restore Minimalistic (only using stdlib) Convenient direc

kelptaken 1 Dec 27, 2021
Python humanize functions

humanize This modest package contains various common humanization utilities, like turning a number into a fuzzy human-readable duration ("3 minutes ag

Jason Moiron 1.6k Jan 01, 2023
PyGMT - A Python interface for the Generic Mapping Tools

PyGMT A Python interface for the Generic Mapping Tools Documentation (development version) | Contact | Try Online Why PyGMT? A beautiful map is worth

The Generic Mapping Tools (GMT) 564 Dec 28, 2022
Python code to divide big numbers

divide-big-num Python code to divide big numbers

VuMinhNgoc 1 Oct 15, 2021
A primitive Python wrapper around the Gromacs tools.

README: GromacsWrapper A primitive Python wrapper around the Gromacs tools. The library is tested with GROMACS 4.6.5, 2018.x, 2019.x, 2020.x, and 2021

Becksteinlab 140 Dec 28, 2022
A python package containing all the basic functions and classes for python. From simple addition to advanced file encryption.

A python package containing all the basic functions and classes for python. From simple addition to advanced file encryption.

PyBash 11 May 22, 2022
✨ Un chois aléatoire d'un article sur Wikipedia totalement fait en Python par moi, et en français.

Wikipedia Random Article ❗ Un chois aléatoire d'un article sur Wikipedia totalement fait en Python par moi, et en français. 🔮 Grâce a une requète a w

MrGabin 4 Jul 18, 2021
A quick random name generator

Random Profile Generator USAGE & CREDITS Any public or priavte demonstrative usage of this project is strictly prohibited, UNLESS WhineyMonkey10 (http

2 May 05, 2022
Playing with python imports and inducing those pesky errors.

super-duper-python-imports In this repository we are playing with python imports and inducing those pesky ImportErrors. File Organization project │

James Kelsey 2 Oct 14, 2021
MicroMIUI - Script to optimize miui and not only

MicroMIUI - Script to optimize miui and not only

Groiznyi-Studio 1 Nov 02, 2021
iOS Snapchat parser for chats and cached files

ParseSnapchat iOS Snapchat parser for chats and cached files Tested on Windows and Linux install required libraries: pip install -r requirements.txt c

11 Dec 05, 2022
Course-parsing - Parsing Course Info for NIT Kurukshetra

Parsing Course Info for NIT Kurukshetra Overview This repository houses code for

Saksham Mittal 3 Feb 03, 2022
Two fast AUC calculation implementations for python

fastauc Two fast AUC calculation implementations for python: python-based is approximately 5X faster than the default sklearn.metrics.roc_auc_score()

Vsevolod Kompantsev 26 Dec 11, 2022
A simple tool that updates your pubspec.yaml file, of a Flutter project, without altering the structure of your file.

A simple tool that updates your pubspec.yaml file, of a Flutter project, without altering the structure of your file.

3 Dec 10, 2021