Python script for transferring data between three drives in two separate stages

Last update: Nov 10, 2021

Related tags

Overview

Waterlock

Waterlock is a Python script meant for incrementally transferring data between three folder locations in two separate stages. It performs hash verification and persistently tracks data transfer progress using SQLite.

I am not responsible for any lost data. This was an evening coding project. Use at your own discretion.

Use Case & Features

The use-case Waterlock was designed for is moving files from one computer (i.e. your home server) to a intermediary drive (i.e. a portable hard drive), and then from the hard drive to another computer (i.e. an offsite backup server).

It will fill the intermediary drive with as many files as it can, aside from a user-configurable amount of reserve-space.
It performs blake2 checksums with every file copy, comparing it to the initial hash value stored in the SQLite database to ensure that data is not corrupted.
It uses a SQLite database to track what data has been moved. As a result, you can incrementally move data from one location to another with minimal user input.
Every time Waterlock is run on the source location, it will check for any files that have been recently modified (based on timestamp, not hash). Any modified files will have their hash & modification timestamps updated in the database, in addition to being marked as unmoved such that they are transferred again and updated. Note that Waterlock does not version files. Nevertheless, silently corrupted files should theoretically not be transferred over unless their modification timestamp has been adjusted.
Every time Waterlock is run on the source location, it will check for any files that were previously moved to the intermediary drive but did not reach the destination. If these files are no longer on the intermediary drive due to accidental deletion for instance, Waterlock will move those files to the intermediary drive again.

Example Use Case: I use Waterlock to transfer large files that are too large to transfer over the network to an offsite backup location at a relatives house. Each time I visit I run the script on my home server to load the external drive, then run it again on the offsite-backup server.

Usage

Change the settings at the top of the script, using absolute file paths. While relative paths may work, they are more error prone due to string formatting issues. Store the script on the intermediary drive itself and run it from there. It will automatically create waterlock.db and a cargo folder where the data will be stored. Note that after the final transfer to the destination, Waterlock will not delete data on the intermediary drive.

python waterlock.py

If you are familiar with Python, you can also fully verify all the files on the middle or destination drives to ensure that the hashes match what is stored in the database. This is done using two additional class functions called verify_middle() and verify_destination(). The code to verify files on the destination would be as follows:

if __name__ == "__main__":
    wl = Waterlock( source_directory=source_directory, 
                    end_directory=end_direcotry, 
                    reserved_space=reserved_space
                    )
    wl.start()
    wl.verify_destination()

Why 'Waterlock'?

It is named Waterlock after marine locks used to move ships through waterways of different water levels in multiple stages.

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

2 Nov 20, 2021

Python script for transferring data between three drives in two separate stages

Related tags

Overview

Waterlock

Use Case & Features

Usage

Why 'Waterlock'?

You might also like...

Catalogue data - A Python Scripts to prepare catalogue data

This is a python script to navigate and extract the FSD50K dataset

Python script to automate the plotting and analysis of percentage depth dose and dose profile simulations in TOPAS.

fds is a tool for Data Scientists made by DAGsHub to version control data and code at once.

A data parser for the internal syncing data format used by Fog of World.

Functional Data Analysis, or FDA, is the field of Statistics that analyses data that depend on a continuous parameter.

Fancy data functions that will make your life as a data scientist easier.

A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Releases(latest)

Owner

David Swanlund

Sentiment analysis on streaming twitter data using Spark Structured Streaming & Python

Approximate Nearest Neighbor Search for Sparse Data in Python!

:truck: Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark

Yet Another Workflow Parser for SecurityHub

My first Python project is a simple Mad Libs program.

wikirepo is a Python package that provides a framework to easily source and leverage standardized Wikidata information

Nobel Data Analysis

A program that uses an API and a AI model to get info of sotcks

Candlestick Pattern Recognition with Python and TA-Lib

Additional tools for particle accelerator data analysis and machine information

Program that predicts the NBA mvp based on data from previous years.

Analyze the Gravitational wave data stored at LIGO/VIRGO observatories

An ETL framework + Monitoring UI/API (experimental project for learning purposes)

A probabilistic programming language in TensorFlow. Deep generative models, variational inference.

Fitting thermodynamic models with pycalphad

Full automated data pipeline using docker images

PyPSA: Python for Power System Analysis

Python data processing, analysis, visualization, and data operations

A lightweight, hub-and-spoke dashboard for multi-account Data Science projects

This python script allows you to manipulate the audience data from Sl.ido surveys