A Python module and command line utility for working with web archive data using the WACZ format specification

Overview

py-wacz

The py-wacz repository contains a Python module and command line utility for working with web archive data using the WACZ format specification. Web Archive Collection Zipped (WACZ) allows web archives to be shared and distributed by providing a predictable way of packaging up web archive data and metadata as a ZIP file. The wacz command line utility supports converting any WARC files into WACZ files, and optionally generating full-text search indices of pages.

Install

Use pip to install the module and a command line utility:

pip install wacz

Once installed you can use the wacz command line utility to create and validate WACZ files.

Create

To create a WACZ package you can point wacz at a WARC file and tell it where to write the WACZ with the -o option:

wacz create -o myfile.wacz 
   

   

The resulting myfile.wacz should be loadable via ReplayWeb.page.

wacz accepts the following options for customizing how the WACZ file is assembled.

-f --file

Explicitly declare the file being passed to the create function.

wacz create -f tests/fixtures/example-collection.warc

-o --output

Explicitly declare the name of the wacz being created

wacz create tests/fixtures/example-collection.warc -o mywacz.wacz

-t --text

Generates pages.jsonl page index with a full-text index, must be run in conjunction with --detect-pages. Will have no effect if run alone

wacz create tests/fixtures/example-collection.warc -t

--detect-pages

Generates pages.jsonl page index without a full-text index

wacz create tests/fixtures/example-collection.warc --detect-pages

-p --pages

Overrides the pages index generation with the passed jsonl pages.

wacz create tests/fixtures/example-collection.warc -p passed_pages.jsonl

-t --text

You can add a full text index by including the --text tag

wacz create tests/fixtures/example-collection.warc -p passed_pages.jsonl --text

--ts

Overrides the ts metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --ts TIMESTAMP

--url

Overrides the url metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --url URL

--title

Overrides the titles metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --title TITLE

--desc

Overrides the desc metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --desc DESC

--hash-type

Allows the user to specify the hash type used: (sha256 or md5):

wacz create tests/fixtures/example-collection.warc --hash-type md5

Validate

You can also validate an existing WACZ file by running:

wacz validate myfile.wacz

-f --file

Explicitly declare the file being passed to the validate function.

wacz validate -f tests/fixtures/example-collection.warc

Testing

If you are developing wacz you can run the unit tests with pytest:

pytest tests
Owner
Webrecorder
Webrecorder Project
Webrecorder
CLI client for FerrisChat

A CLI Client for @FerrisChat using FerrisWheel

FerrisChat 2 Apr 01, 2022
Python CLI for accessing CSCI320 PDM Database

p320_14 Python CLI for accessing CSCI320 PDM Database Authors: Aidan Mellin Dan Skigen Jacob Auger Kyle Baptiste Before running the application for th

Aidan Mellin 1 Nov 23, 2021
This is a CLI utility that allows you to view RedFlagDeals.com on the command line.

RFD Description Motivation Installation Usage View Hot Deals View and Sort Hot Deals Search Advanced View Posts Shell Completion bash zsh Description

Dave G 8 Nov 29, 2022
A dilligent command line tool to publish ads on ebay-kleinanzeigen.de

kleinanzeigen-bot Feedback and high-quality pull requests are highly welcome! About Installation Usage Development Notes License About kleinanzeigen-b

83 Dec 26, 2022
AlienFX is a CLI and GUI utility to control the lighting effects of your Alienware computer.

AlienFX is a Linux utility to control the lighting effects of your Alienware computer. At present there is a CLI version (alienfx) and a gtk GUI versi

Stephen Harris 218 Dec 26, 2022
A super simple wallet application for the NANO cryptocurrency that runs in the terminal

Nano Terminal Wallet A super simple wallet application for the NANO cryptocurrency that runs in the terminal Written in 2021 by NinjaSnail1080 (Discor

9 Jul 22, 2022
A terminal spreadsheet multitool for discovering and arranging data

VisiData v2.6.1 A terminal interface for exploring and arranging tabular data. VisiData supports tsv, csv, sqlite, json, xlsx (Excel), hdf5, and many

Saul Pwanson 6.2k Jan 04, 2023
This is a tool for managing file notes through the command line

This is a tool for managing file notes through the command line

2 Jun 22, 2022
liquidctl – liquid cooler control Cross-platform tool and drivers for liquid coolers and other devices

Cross-platform CLI and Python drivers for AIO liquid coolers and other devices

1.7k Jan 08, 2023
A terminal client for connecting to hack.chat servers

A terminal client for connecting to hack.chat servers.

V9 2 Sep 21, 2022
A library for creating text-based graphs in the terminal

tplot is a Python package for creating text-based graphs. Useful for visualizing data to the terminal or log files.

Jeroen Delcour 164 Dec 14, 2022
Command line util for grep.app - Search across a half million git repos

grepgithub Command line util for grep.app - Search across a half million git repos Grepgithub uses grep.app API to search GitHub repositories, providi

Nenad Popovic 18 Dec 28, 2022
Play Wordle Bot - Wordle Bot written in python

Wordle Bot A Bot written in python with a CL Interface to guess adn solve Wordle

Prashant 1 Feb 25, 2022
Bryce Geiser 4 Aug 04, 2022
Wappalyzer CLI tool to find Web Technologies

Wappalyzer CLI tool to find Web Technologies

GOKUL A.P 17 Dec 15, 2022
Play videos in the terminal.

Termvideo Play videos in the terminal (stdout). python main.py /path/to/video.mp4 Terminal size: -x output_width, -y output_height. Default autodetect

Patrick 11 Jun 13, 2022
Logic-Sim - A clone of 'Digital Logic Sim' from Sebastian Lague

Logic Simulator This is a clone of 'Digital Logic Sim' from Sebastian Lague. But

Ethan 1 Feb 01, 2022
A CLI for advanced management of your notes with simple commands

PyNoteManager This is a CLI for advanced management of your notes with simple co

3 Dec 30, 2021
Customisable pharmacokinetic model accessible via bash CLI allowing for variable dose calculations as well as intravenous and subcutaneous administration calculations

Pharmacokinetic Modelling Group Project A PharmacoKinetic (PK) modelling function for analysis of injected solute dynamics over time, developed by Gro

1 Oct 24, 2021
Bonjour Software pypahe is a Python Package Helper command-line tool.

pypahe Bonjour Software pypahe is a Python Package Helper command-line tool. Requirements Docker runtime Usage print the latest available version of a

Bonjour Software 0 Aug 10, 2021