hop

Simple archive format designed for quickly reading some files without extracting the entire archive. Possibly will be used in Bun.

25x faster than unzip and 10x faster than tar at reading individual files (uncompressed)

Format	Random access	Fast extraction	Fast archiving	Compression	Encryption	Append
hop	✅	✅	✅	❌	❌	❌
tar	❌	✅	✅	❌	❌	✅
zip	✅ (when small)	❌	❌	✅	✅	✅

Features:

Faster at printing individual files than tar & zip (compression disabled)
Faster extraction than zip, comparable to tar (compression disabled)
Faster archiving than zip, comparable to tar (compression disabled)

Anti-features:

Single-threaded (but doesn't need to be)
I wrote it in about 3 hours and there are no tests
No checksums yet. Probably not a good idea to use this for untrusted data until that's fixed.
Ignores symlinks
Can't be larger than 4 GB
Archives are read-only and file names are not normalized across platforms

Usage

Download the binary from /releases

To create an archive:

hop ./path-to-folder

To extract an archive:

hop archive.hop

To print one file from the archive:

hop archive.hop package.json

Why?

Why can't software read many tiny files with similar performance characteristics as individual files?

Reading and writing lots of tiny files incurs significant syscall overhead, and (npm) packages often have lots of tiny files. Zip files are unacceptably slow to read from like a directory. tar files extract quickly, but are slow at non-sequential access.
Reading directory entries (ls) in large directory trees is slow

Some benchmarks

On macOS 12 with an M1X

Using tigerbeetle github repo as an example

Extracting:

Archiving:

On an Ubuntu AMD64 server

Extracting a node_modules folder

Why faster?

It stores an array of hashes for each file path and the list of files are sorted lexigraphically. This makes non-sequential access faster than tar, but can make creating new archives slower.
Does not store directories, only files
.hop files are read-only (more precisely, one could append but would have to rewrite all metadata)
copy_file_range
packed struct makes serialization & deserialization very fast because there is very little encoding/decoding step.

How does it work?

File contents go at the top, file metadata goes at the bottom
This is the metadata it currently stores:

package Hop;

struct StringPointer {
    uint32 off;
    uint32 len;
}

struct File {
    StringPointer name;
    uint32 name_hash;
    uint32 chmod;
    uint32 mtime;
    uint32 ctime;
    StringPointer data;
}

message Archive {
    uint32 version = 1;
    uint32 content_offset = 2;
    File[] files = 3;
    uint32[] name_hashes = 4;
    byte[] metadata = 5;
}

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.vscode		.vscode
src		src
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
build.zig		build.zig
schema.peechy		schema.peechy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.vscode

.vscode

src

src

.gitignore

.gitignore

Makefile

Makefile

README.md

README.md

build.zig

build.zig

schema.peechy

schema.peechy

Repository files navigation

hop

Usage

Why?

Some benchmarks

On macOS 12 with an M1X

On an Ubuntu AMD64 server

Why faster?

How does it work?

About

Releases 1

Packages

Contributors 2

Languages

Jarred-Sumner/hop

Folders and files

Latest commit

History

Repository files navigation

hop

Usage

Why?

Some benchmarks

On macOS 12 with an M1X

On an Ubuntu AMD64 server

Why faster?

How does it work?

About

Resources

Stars

Watchers

Forks

Languages