A Python module and command line utility for working with web archive data using the WACZ format specification

Overview

py-wacz

The py-wacz repository contains a Python module and command line utility for working with web archive data using the WACZ format specification. Web Archive Collection Zipped (WACZ) allows web archives to be shared and distributed by providing a predictable way of packaging up web archive data and metadata as a ZIP file. The wacz command line utility supports converting any WARC files into WACZ files, and optionally generating full-text search indices of pages.

Install

Use pip to install the module and a command line utility:

pip install wacz

Once installed you can use the wacz command line utility to create and validate WACZ files.

Create

To create a WACZ package you can point wacz at a WARC file and tell it where to write the WACZ with the -o option:

wacz create -o myfile.wacz 
   

   

The resulting myfile.wacz should be loadable via ReplayWeb.page.

wacz accepts the following options for customizing how the WACZ file is assembled.

-f --file

Explicitly declare the file being passed to the create function.

wacz create -f tests/fixtures/example-collection.warc

-o --output

Explicitly declare the name of the wacz being created

wacz create tests/fixtures/example-collection.warc -o mywacz.wacz

-t --text

Generates pages.jsonl page index with a full-text index, must be run in conjunction with --detect-pages. Will have no effect if run alone

wacz create tests/fixtures/example-collection.warc -t

--detect-pages

Generates pages.jsonl page index without a full-text index

wacz create tests/fixtures/example-collection.warc --detect-pages

-p --pages

Overrides the pages index generation with the passed jsonl pages.

wacz create tests/fixtures/example-collection.warc -p passed_pages.jsonl

-t --text

You can add a full text index by including the --text tag

wacz create tests/fixtures/example-collection.warc -p passed_pages.jsonl --text

--ts

Overrides the ts metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --ts TIMESTAMP

--url

Overrides the url metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --url URL

--title

Overrides the titles metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --title TITLE

--desc

Overrides the desc metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --desc DESC

--hash-type

Allows the user to specify the hash type used: (sha256 or md5):

wacz create tests/fixtures/example-collection.warc --hash-type md5

Validate

You can also validate an existing WACZ file by running:

wacz validate myfile.wacz

-f --file

Explicitly declare the file being passed to the validate function.

wacz validate -f tests/fixtures/example-collection.warc

Testing

If you are developing wacz you can run the unit tests with pytest:

pytest tests
Owner
Webrecorder
Webrecorder Project
Webrecorder
Dart Version Manager CLI implemented with Python and Typer.

Dart Version Manager CLI implemented with Python and Typer.

EducUp 6 Jun 26, 2022
🗃️ Fileio-cli wrapper for fileioapi.py with fire.py, inspiration DOS

🗃️ File.io File.io simply upload a file, share the link, and after it is downloaded, the file is completely deleted. An API wrapper for the file.io w

nkot56297 2 May 12, 2022
Command line util for grep.app - Search across a half million git repos

grepgithub Command line util for grep.app - Search across a half million git repos Grepgithub uses grep.app API to search GitHub repositories, providi

Nenad Popovic 18 Dec 28, 2022
gcptree - Like the unix tree command but for GCP Org Heirarchy

gcptree Like the unix tree command but for GCP Org Heirarchy. For a note on coloring, the org node is green, folders and blue, and projects that are n

Ryan Canty 25 Sep 06, 2022
Apple Silicon 'top' CLI

asitop pip install asitop What A nvtop/htop style/inspired command line tool for Apple Silicon (aka M1) Macs. Note that it requires sudo to run due to

Timothy Liu 1.2k Dec 31, 2022
py-image-dedup is a tool to sort out or remove duplicates within a photo library

py-image-dedup is a tool to sort out or remove duplicates within a photo library. Unlike most other solutions, py-image-dedup intentionally uses an approximate image comparison to also detect duplica

Markus Ressel 96 Jan 02, 2023
Bear-Shell is a shell based in the terminal or command prompt.

Bear-Shell is a shell based in the terminal or command prompt. You can navigate files, run python files, create files via the BearUtils text editor, and a lot more coming up!

MichaelBear 6 Dec 25, 2021
A CLI tool to disable and enable security standards controls in AWS Security Hub

Security Hub Controls CLI A CLI tool to disable and enable security standards controls in AWS Security Hub. It is designed to work together with AWS S

AWS Samples 4 Nov 14, 2022
Un module simple pour demander l'accord de l'utilisateur dans une CLI.

Demande de confirmation utilisateur pour CLI Présentation ask_lib est un module pour le langage Python proposant une seule fonction; ask(). Le but pri

CallMePixelMan 7 May 09, 2022
CPOST is a CLI tool to assist with the proper sizing of Clara Deploy pipelines

CPOST (Clara Pipeline Operator Sizing Tool) Tool to measure resource usage of Clara Platform pipeline operators Cpost is a tool that will help you run

NVIDIA Corporation 5 Sep 27, 2021
Command-line interface to PyPI Stats API to get download stats for Python packages

pypistats Python 3.6+ interface to PyPI Stats API to get aggregate download statistics on Python packages on the Python Package Index without having t

Hugo van Kemenade 140 Jan 03, 2023
Tiny command-line utility for mapping broken keys to other positions.

brokenkey Tiny command-line utility for mapping broken keys to other positions. Installation Clone this repository using git: git clone https://github

0 Oct 04, 2021
Colors in Terminal - Python Lang

🎨 Colorate - Python 🎨 About Colorate is an Open Source project that makes it easy to use Python color coding in your projects. After downloading the

0110 Henrique 1 Dec 01, 2021
A Python-based command prompt concept which includes windows command emulation.

PythonCMD A Python-based command prompt concept which includes windows command emulation. Current features: echo: Input your message and it will be cl

1 Feb 05, 2022
A simple automation script that logs into your kra account and files your taxes with one command

EASY_TAX A simple automation script that logs into your kra account and files your taxes with one command Currently works for Chrome users. Will creat

leon koech 13 Sep 23, 2021
A command line application, written in Python, for interacting with Spotify.

spotify-py-cli A command line application, written in Python, for interacting with Spotify. The primary purpose behind developing this app was to gain

Drew Loukusa 0 Oct 07, 2021
A Python-based Wordle solver and CLI player

Wordle A Python-based Wordle solver and CLI player This was created using Python 3.9.7. SPOILER ALERT: the data directory contains spoilers for upcomi

Will Fitzgerald 1 Jul 24, 2022
Present - A terminal-based presentation tool with colors and effects.

present A terminal-based presentation tool with colors and effects. You can also play a codio (pre-recorded code block) on a slide. present is built o

Vinayak Mehta 4.2k Jan 03, 2023
Use case: quick JSON processing/restructuring with jq without terminal

alfred-jq Alfred workflow to conveniently process JQ on clipboard based on a jq query Also available at: packal/jq Use case: quick JSON processing/res

T on Meta Mode 5 Sep 30, 2022