A Python module and command line utility for working with web archive data using the WACZ format specification

Overview

py-wacz

The py-wacz repository contains a Python module and command line utility for working with web archive data using the WACZ format specification. Web Archive Collection Zipped (WACZ) allows web archives to be shared and distributed by providing a predictable way of packaging up web archive data and metadata as a ZIP file. The wacz command line utility supports converting any WARC files into WACZ files, and optionally generating full-text search indices of pages.

Install

Use pip to install the module and a command line utility:

pip install wacz

Once installed you can use the wacz command line utility to create and validate WACZ files.

Create

To create a WACZ package you can point wacz at a WARC file and tell it where to write the WACZ with the -o option:

wacz create -o myfile.wacz 
   

   

The resulting myfile.wacz should be loadable via ReplayWeb.page.

wacz accepts the following options for customizing how the WACZ file is assembled.

-f --file

Explicitly declare the file being passed to the create function.

wacz create -f tests/fixtures/example-collection.warc

-o --output

Explicitly declare the name of the wacz being created

wacz create tests/fixtures/example-collection.warc -o mywacz.wacz

-t --text

Generates pages.jsonl page index with a full-text index, must be run in conjunction with --detect-pages. Will have no effect if run alone

wacz create tests/fixtures/example-collection.warc -t

--detect-pages

Generates pages.jsonl page index without a full-text index

wacz create tests/fixtures/example-collection.warc --detect-pages

-p --pages

Overrides the pages index generation with the passed jsonl pages.

wacz create tests/fixtures/example-collection.warc -p passed_pages.jsonl

-t --text

You can add a full text index by including the --text tag

wacz create tests/fixtures/example-collection.warc -p passed_pages.jsonl --text

--ts

Overrides the ts metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --ts TIMESTAMP

--url

Overrides the url metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --url URL

--title

Overrides the titles metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --title TITLE

--desc

Overrides the desc metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --desc DESC

--hash-type

Allows the user to specify the hash type used: (sha256 or md5):

wacz create tests/fixtures/example-collection.warc --hash-type md5

Validate

You can also validate an existing WACZ file by running:

wacz validate myfile.wacz

-f --file

Explicitly declare the file being passed to the validate function.

wacz validate -f tests/fixtures/example-collection.warc

Testing

If you are developing wacz you can run the unit tests with pytest:

pytest tests
Owner
Webrecorder
Webrecorder Project
Webrecorder
CLI for SQLite Databases with auto-completion and syntax highlighting

litecli Docs A command-line client for SQLite databases that has auto-completion and syntax highlighting. Installation If you already know how to inst

dbcli 1.8k Dec 31, 2022
A Simple Python CLI Lockpicking Tool

Cryptex a simple CLI lockpicking tool What can it do: Encode / Decode Hex Encode / Decode Base64 Break Randomly :D Requirements: Python3 Linux as your

Alex Kollar 23 Jul 04, 2022
A supercharged AWS command line interface (CLI).

SAWS Motivation AWS CLI Although the AWS CLI is a great resource to manage your AWS-powered services, it's tough to remember usage of: 70+ top-level c

Donne Martin 5.1k Jan 05, 2023
nbcommands bring the goodness of Unix commands to Jupyter notebooks.

nbcommands nbcommands bring the goodness of Unix commands to Jupyter notebooks. Installation You can simply use pip to install nbcommands: $ pip insta

Vinayak Mehta 181 Dec 23, 2022
Ralph is a command-line tool to fetch, extract, convert and push your tracking logs from various storage backends to your LRS or any other compatible storage or database backend.

Ralph is a command-line tool to fetch, extract, convert and push your tracking logs (aka learning events) from various storage backends to your

France Université Numérique 18 Jan 05, 2023
A command line tool that creates a super timeline from SentinelOne's Deep Visibility data

S1SuperTimeline A command line tool that creates a super timeline from SentinelOne's Deep Visibility data What does it do? The script accepts a S1QL q

Juan Ortega 2 Feb 08, 2022
py-image-dedup is a tool to sort out or remove duplicates within a photo library

py-image-dedup is a tool to sort out or remove duplicates within a photo library. Unlike most other solutions, py-image-dedup intentionally uses an approximate image comparison to also detect duplica

Markus Ressel 96 Jan 02, 2023
Simple CLI prompt for easy I/O with OpenAI's API

openai-cli-prompt Simple CLI prompt for easy I/O with OpenAI's API Quickstart Create a .env file with: OPENAI_API_KEY=Your OpenAI API Key Configure

Erik Nomitch 1 Oct 12, 2021
👻 Ghoul is an easy to use information service, allowing you to get/add information on someone or something directly from your terminal.

👻 Ghoul is an easy to use information service, allowing you to get/add information on someone or something directly from your terminal. It c

Billy 11 Nov 10, 2021
Easily turn single threaded command line applications into a fast, multi-threaded application with CIDR and glob support.

Easily turn single threaded command line applications into a fast, multi-threaded application with CIDR and glob support.

Michael Skelton 1k Jan 07, 2023
A useful and easy to use Terminal Timer made with Python.

Terminal SpeedCubeTimer Installation ¡No requirements! Just Download and play Usage Starts timer.py and you will see this. python timer.py Scramble

Achalogy 5 Dec 22, 2022
A Python module and command line utility for working with web archive data using the WACZ format specification

py-wacz The py-wacz repository contains a Python module and command line utility for working with web archive data using the WACZ format specification

Webrecorder 14 Oct 24, 2022
CLabel is a terminal-based cluster labeling tool that allows you to explore text data interactively and label clusters based on reviewing that data.

CLabel is a terminal-based cluster labeling tool that allows you to explore text data interactively and label clusters based on reviewing that

Peter Baumgartner 29 Aug 09, 2022
Booky - A command line utility for bookmarking files on your terminal!

Booky A command line utility for bookmarking files for quick access With it you can: Bookmark and delete your (aliases of) files at demand Launch them

Pran 1 Sep 11, 2022
Project scoped command execution to just do your work

Judoka is a command line utility that lets you define project scoped commands and call them through their alias. It lets you just do (= judo) your work.

Eelke van den Bos 2 Dec 17, 2021
Python3 parser for Apple's crash reports

pyCrashReport in intended for analyzing crash reports from Apple devices into a clearer view, without all the thread listing and loaded images, just the actual data you really need to debug the probl

7 Aug 19, 2022
You'll never want to use cd again.

Jmp Description Have you ever used the cd command? You'll never touch that outdated thing again when you try jmp. Navigate your filesystem with unprec

Grant Holmes 21 Nov 03, 2022
A Terminal UI for Discord

ToastCord ToastCord is a Discord Terminal UI. At the moment you can only look at Direct messages. TODO: - Add support for guilds - Message sending sup

toast 82 Dec 18, 2022
Display Images in your terminal with python

A python library to display images in the terminal

Pranav Baburaj 57 Dec 30, 2022
A very simple and lightweight ToDo app using python that can be used from the command line

A very simple and lightweight ToDo app using python that can be used from the command line

Nilesh Sengupta 2 Jul 20, 2022