A Python module and command line utility for working with web archive data using the WACZ format specification

Overview

py-wacz

The py-wacz repository contains a Python module and command line utility for working with web archive data using the WACZ format specification. Web Archive Collection Zipped (WACZ) allows web archives to be shared and distributed by providing a predictable way of packaging up web archive data and metadata as a ZIP file. The wacz command line utility supports converting any WARC files into WACZ files, and optionally generating full-text search indices of pages.

Install

Use pip to install the module and a command line utility:

pip install wacz

Once installed you can use the wacz command line utility to create and validate WACZ files.

Create

To create a WACZ package you can point wacz at a WARC file and tell it where to write the WACZ with the -o option:

wacz create -o myfile.wacz 
   

   

The resulting myfile.wacz should be loadable via ReplayWeb.page.

wacz accepts the following options for customizing how the WACZ file is assembled.

-f --file

Explicitly declare the file being passed to the create function.

wacz create -f tests/fixtures/example-collection.warc

-o --output

Explicitly declare the name of the wacz being created

wacz create tests/fixtures/example-collection.warc -o mywacz.wacz

-t --text

Generates pages.jsonl page index with a full-text index, must be run in conjunction with --detect-pages. Will have no effect if run alone

wacz create tests/fixtures/example-collection.warc -t

--detect-pages

Generates pages.jsonl page index without a full-text index

wacz create tests/fixtures/example-collection.warc --detect-pages

-p --pages

Overrides the pages index generation with the passed jsonl pages.

wacz create tests/fixtures/example-collection.warc -p passed_pages.jsonl

-t --text

You can add a full text index by including the --text tag

wacz create tests/fixtures/example-collection.warc -p passed_pages.jsonl --text

--ts

Overrides the ts metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --ts TIMESTAMP

--url

Overrides the url metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --url URL

--title

Overrides the titles metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --title TITLE

--desc

Overrides the desc metadata value in the datapackage.json file

wacz create tests/fixtures/example-collection.warc --desc DESC

--hash-type

Allows the user to specify the hash type used: (sha256 or md5):

wacz create tests/fixtures/example-collection.warc --hash-type md5

Validate

You can also validate an existing WACZ file by running:

wacz validate myfile.wacz

-f --file

Explicitly declare the file being passed to the validate function.

wacz validate -f tests/fixtures/example-collection.warc

Testing

If you are developing wacz you can run the unit tests with pytest:

pytest tests
Owner
Webrecorder
Webrecorder Project
Webrecorder
Custom 64 bit shellcode encoder that evades detection and removes some common badchars (\x00\x0a\x0d\x20)

x64-shellcode-encoder Custom 64 bit shellcode encoder that evades detection and removes some common badchars (\x00\x0a\x0d\x20) Usage Using a generato

Cole Houston 2 Jan 26, 2022
Command line interface for unasync

CLI for unasync Command line interface for unasync Getting started Install Run the following command to install the package with pip: pip install unas

Leynier Gutiérrez González 3 Apr 04, 2022
Tstock - Check stocks from the terminal

tstock - Check stocks from the terminal! 📈 tstock is a tool to easily generate stock charts from the command line. Just type tstock aapl to get a 3 m

Gabe Banks 502 Dec 30, 2022
A terminal UI dashboard to monitor requests for code review across Github and Gitlab repositories.

A terminal UI dashboard to monitor requests for code review across Github and Gitlab repositories.

Kyle Harrison 150 Dec 14, 2022
QueraToCSV is a simple python CLI project to convert the Quera results file into CSV files.

Quera is an Iranian Learning management system (LMS) that has an online judge for programming languages. Some Iranian universities use it to automate the evaluation of programming assignments.

Amirmahdi Namjoo 16 Nov 11, 2022
Command-line tool for downloading and extending the RedCaps dataset.

Command-line tool for downloading and extending the RedCaps dataset.

RedCaps dataset 33 Dec 14, 2022
Trans is a dependency-free CLI for Google Translate

Trans is a dependency-free CLI for Google Translate

11 Jan 04, 2022
Urial (URI Addition tooL) intelligently updates URIs stored in Finder comments of macOS files

Urial Urial (URI addition tool) is a simple but intelligent command-line tool to add or replace URIs found inside macOS Finder comments. Table of cont

Mike Hucka 3 Sep 14, 2022
Salesforce object access auditor

Salesforce object access auditor Released as open source by NCC Group Plc - https://www.nccgroup.com/ Developed by Jerome Smith @exploresecurity (with

NCC Group Plc 90 Sep 19, 2022
A command-line tool to flash python code to Codey Rocky without having to use the online mblock5 IDE.

What? A command-line tool to flash python code to Codey Rocky without having to use the online mblock5 IDE. Description This is a very low-effort proj

1 Dec 29, 2021
Command-line tool to use LNURL with your LND instance

Sprint planner Sprint planner is a Python script for planning your Jira tasks based on your calendar availability. Installation Use the package manage

Djuri Baars 6 Jan 14, 2022
A cd command that learns - easily navigate directories from the command line

NAME autojump - a faster way to navigate your filesystem DESCRIPTION autojump is a faster way to navigate your filesystem. It works by maintaining a d

William Ting 14.5k Jan 03, 2023
ctree - command line christmas tree

ctree ctree - command line christmas tree It is small python script that prints a christmas tree in terminal. It is colourful and always gives you a d

15 Aug 15, 2022
tox-server is a command line tool which runs tox in a loop and calls it with commands from a remote CLI.

Tox Server tox-server is a command line tool which runs tox in a loop and calls it with commands from a remote CLI. It responds to commands via ZeroMQ

Alexander Rudy 3 Jan 10, 2022
Python-based implementation and comparison of strategies to guess words at Wordle

Solver and comparison of strategies for Wordle Motivation The goal of this repository is to compare, in terms of performance, strategies that minimize

Ignacio L. Ibarra 4 Feb 16, 2022
A powerful, colorful, beautiful command-line-interface for pypi.org

pypi-command-line pypi-command-line is a colorful, powerful, and beautiful command line interface for pypi.org that is actively maintained Detailed Do

Wasi Master 32 Jun 23, 2022
Set of scripts & tools for converting between numbers and major system encoded words.

major-system-converter Set of scripts & tools for converting between numbers and major system encoded words. Uses phonetics instead of letters to conv

4 Aug 09, 2022
This project contains the ClonedPerson dataset and code described in our paper "Cloning Outfits from Real-World Images to 3D Characters for Generalizable Person Re-Identification".

ClonedPerson This is the official repository for the ClonedPerson project, which contains the ClonedPerson dataset and code described in our paper "Cl

Yanan Wang 55 Dec 27, 2022
A CLI Application to detect plagiarism in Source Code Files.

Plag Description A CLI Application to detect plagiarism in Source Code Files. Features Compare source code files for plagiarism. Extract code features

default=dev 2 Nov 10, 2022
argofloats: Simple CLI for ArgoVis and Argofloats

argofloats: Simple CLI for ArgoVis and Argofloats Argo is an international program that collects information from inside the ocean using a fleet of ro

Samapriya Roy 2 Feb 13, 2022