A Python module and command line utility for working with web archive data using the WACZ format specification

Overview

py-wacz

The py-wacz repository contains a Python module and command line utility for working with web archive data using the WACZ format specification. Web Archive Collection Zipped (WACZ) allows web archives to be shared and distributed by providing a predictable way of packaging up web archive data and metadata as a ZIP file. The wacz command line utility supports converting any WARC files into WACZ files, and optionally generating full-text search indices of pages.

Install

Use pip to install the module and a command line utility:

pip install wacz

Once installed you can use the wacz command line utility to create and validate WACZ files.

Create

To create a WACZ package you can point wacz at a WARC file and tell it where to write the WACZ with the -o option:

wacz create -o myfile.wacz 
   

   

The resulting myfile.wacz should be loadable via ReplayWeb.page.

wacz accepts the following options for customizing how the WACZ file is assembled.

-f --file

Explicitly declare the file being passed to the create function.

wacz create -f tests/fixtures/example-collection.warc

-o --output

Explicitly declare the name of the wacz being created

wacz create tests/fixtures/example-collection.warc -o mywacz.wacz

-t --text

Generates pages.jsonl page index with a full-text index, must be run in conjunction with --detect-pages. Will have no effect if run alone

wacz create tests/fixtures/example-collection.warc -t

--detect-pages

Generates pages.jsonl page index without a full-text index

wacz create tests/fixtures/example-collection.warc --detect-pages

-p --pages

Overrides the pages index generation with the passed jsonl pages.

wacz create tests/fixtures/example-collection.warc -p passed_pages.jsonl

-t --text

You can add a full text index by including the --text tag

wacz create tests/fixtures/example-collection.warc -p passed_pages.jsonl --text

--ts

Overrides the ts metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --ts TIMESTAMP

--url

Overrides the url metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --url URL

--title

Overrides the titles metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --title TITLE

--desc

Overrides the desc metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --desc DESC

--hash-type

Allows the user to specify the hash type used: (sha256 or md5):

wacz create tests/fixtures/example-collection.warc --hash-type md5

Validate

You can also validate an existing WACZ file by running:

wacz validate myfile.wacz

-f --file

Explicitly declare the file being passed to the validate function.

wacz validate -f tests/fixtures/example-collection.warc

Testing

If you are developing wacz you can run the unit tests with pytest:

pytest tests
Owner
Webrecorder
Webrecorder Project
Webrecorder
๐Ÿ—ƒ๏ธ Fileio-cli wrapper for fileioapi.py with fire.py, inspiration DOS

๐Ÿ—ƒ๏ธ File.io File.io simply upload a file, share the link, and after it is downloaded, the file is completely deleted. An API wrapper for the file.io w

nkot56297 2 May 12, 2022
Simple Tool To Grab Like-Card Coupon

Simple Tool To Grab Like-Card Coupon

Soud 10 Jan 30, 2022
Stephen's Obsessive Note-Storage Engine.

Latest Release ยท PyPi Package ยท Issues ยท Changelog ยท License # Get Sonse and tell it where your notes are... $ pip install sonse $ export SONSE="$HOME

Stephen Malone 23 Jun 10, 2022
Gamma ion pump QPC ethernet Python library & CLI utility

Unofficial Gamma ion pump ethernet control CLI utility and library This is a mini Python 3 library and utility that exposes some of the functions of t

2 Jul 18, 2022
A simple CLI tool for tracking Pikud Ha'oref alarms.

Pikud Ha'oref Alarm Tracking A simple CLI tool for tracking Pikud Ha'oref alarms. Polls the unofficial API endpoint every second for incoming alarms.

Yuval Adam 24 Oct 10, 2022
This a simple tool to query the awesome ippsec.rocks website from your terminal

ippsec-cli This a simple tool to query the awesome ippsec.rocks website from your terminal Installation and usage cd /opt git clone https://github.com

stark0de 5 Nov 26, 2022
dotfilery, configuration, environment settings, automation, etc.

โ”Œโ”ฌโ”โ”Œโ”€โ”โ”Œโ”€โ”โ”Œโ”€โ”โ”ฌ โ”ฌโ”Œโ”ฌโ”โ”ฌ โ”ฌโ”ฌโ”Œโ”€โ” โ”‚โ”‚โ”‚โ”œโ”ค โ”‚ โ”ฌโ”œโ”€โ”คโ”‚ โ”‚ โ”‚ โ”œโ”€โ”คโ”‚โ”‚ :: bits & bobs, dots & things. โ”ด โ”ดโ””โ”€โ”˜โ””โ”€โ”˜โ”ด โ”ดโ”ดโ”€โ”˜โ”ด โ”ด โ”ด โ”ดโ”ดโ””โ”€โ”˜ @megalithic ๐Ÿš€ Instal

Seth Messer 89 Dec 25, 2022
AthenaCLI is a CLI tool for AWS Athena service that can do auto-completion and syntax highlighting.

Introduction AthenaCLI is a command line interface (CLI) for the Athena service that can do auto-completion and syntax highlighting, and is a proud me

dbcli 192 Jan 07, 2023
CPOST is a CLI tool to assist with the proper sizing of Clara Deploy pipelines

CPOST (Clara Pipeline Operator Sizing Tool) Tool to measure resource usage of Clara Platform pipeline operators Cpost is a tool that will help you run

NVIDIA Corporation 5 Sep 27, 2021
Doing set operations on files considered as sets of lines

CLI tool that can be used to do set operations like union on files considering them as a set of lines. Notes It ignores all empty lines with whitespac

Partho 11 Sep 06, 2022
Command-line parsing library for Python 3.

Command-line parsing library for Python 3.

36 Dec 15, 2022
grungegirl is the hacker's drug encyclopedia. programmed in python for maximum modularity and ease of configuration.

grungegirl. cli-based drug search for girls. welcome. grungegirl is aiming to be the premier drug culture application. it is the hacker's encyclopedia

Eristava 10 Oct 02, 2022
Display Images in your terminal with python

A python library to display images in the terminal

Pranav Baburaj 57 Dec 30, 2022
A command-line tool to flash python code to Codey Rocky without having to use the online mblock5 IDE.

What? A command-line tool to flash python code to Codey Rocky without having to use the online mblock5 IDE. Description This is a very low-effort proj

1 Dec 29, 2021
A command line tool that creates a super timeline from SentinelOne's Deep Visibility data

S1SuperTimeline A command line tool that creates a super timeline from SentinelOne's Deep Visibility data What does it do? The script accepts a S1QL q

Juan Ortega 2 Feb 08, 2022
Automaton - python script to execute bash command based on changes in size of a file.

automaton python script to execute given command = everytime size of a given file changes,hence everytime a file is modified.(almost) download automa

asrar bhat 1 Jan 03, 2022
Convert ACSM files to DRM-free EPUB files with one command on Linux

Knock Convert ACSM files to DRM-free EPUB files using one command. This software does not utilize Adobe Digital Editions nor Wine. It is completely fr

Benton Edmondson 622 Dec 09, 2022
Todo - You could use terminal to set your todo

Python Tutorial You can learn how to build a terminal application(CLI applicatio

29 Jun 29, 2022
uploadgram uses your Telegram account to upload files up to 2GiB, from the Terminal.

uploadgram uploadgram uses your Telegram account to upload files up to 2GiB, from the Terminal. Heavily inspired by the telegram-upload Installing: pi

Shrimadhav U K 97 Jan 06, 2023
split-manga-pages: a command line utility written in Python that converts your double-page layout manga to single-page layout.

split-manga-pages split-manga-pages is a command line utility written in Python that converts your double-page layout manga (or any images in double p

Christoffer Aakre 3 May 24, 2022