Arxiv harvester - Poor man's simple harvester for arXiv resources

Overview

Poor man's simple harvester for arXiv resources

This modest Python script takes advantage of arXiv resources hosted by Kaggle to harvest arXiv metadata and PDF, without using the AWS requester paid buckets.

The harvester performs the following tasks:

  • parse the full JSON arXiv metadata file available at Kaggle

  • parallel download PDF located at the public access bucket gs://arxiv-dataset and store them (also in parallel) on a cloud storage, AWS S3 and Swift OpenStack supported, or on the local file system

  • store the metadata of the uploaded article along with the PDF in JSON format

To save storage space, only the most recent available version of the PDF for an article is harvested, not every available versions.

Resuming interrupted and incremental update are automatically supported.

In case an article is only available in postcript, it will be converted into PDF too - but it is extremely rare (and usually when it happens the conversion fails because the PostSript is corrupted...).

Install

The tool is supposed to work on a POSIX environment. External call to the following command lines are used: gzip, gunzip and ps2pdf.

First, download the full arXiv metadata JSON file available at https://www.kaggle.com/Cornell-University/arxiv (1GB compressed). It's actually a JSONL file (one JSON document per line), currently named arxiv-metadata-oai-snapshot.json.zip. You can also generate yourself this file with arxiv-public-dataset OAI harvester using the arXiv OAI-PMH service.

Get this github repo:

git clone https://github.com/kermitt2/arxiv_harvester
cd arxiv_harvester

Setup a virtual environment:

virtualenv --system-site-packages -p python3.8 env
source env/bin/activate

Install the dependencies:

pip3 install -r requirements.txt

Finally install the project in editable state:

pip3 install -e .

Usage

First check the configuration file:

  • set the parameters according to your selected storage (AWS S3, SWIFT OepnStack or local storage), see below for more details,
  • the default batch_size for parallel download/upload is 10, change it as you wish and dare,
  • by default gzip compression of files on the target storage is selected.
arXiv harvester

optional arguments:
  -h, --help           show this help message and exit
  --config CONFIG      path to the config file, default is ./config.json
  --reset              ignore previous processing states and re-init the harvesting process from
                       the beginning
  --metadata METADATA  arXiv metadata json file
  --diagnostic         produce a summary of the harvesting

For example, to harvest articles from a metadata snapshot file:

python3 arxiv_harvester/harvester.py --metadata arxiv-metadata-oai-snapshot.json.zip --config config.json

To reset an existing harvesting and starts the harvesting again from scratch, add the --reset argument:

python3 arxiv_harvester/harvester.py --metadata arxiv-metadata-oai-snapshot.json.zip --config config.json --reset

Note that with --reset, no actual stored PDF file is removed - only the harvesting process is reinitialized.

Interrupted harvesting / Incremental update

Launching the harvesting command on an interrupted harvesting will resume the harvesting automatically where it stopped.

If the arXiv metadata file has been updated to a newer version (downloaded from https://www.kaggle.com/Cornell-University/arxiv or generated with arxiv-public-dataset OAI harvester), launching the harvesting command on the updated metadata file will harvest only the new and updated articles (new most recent PDF version).

Resource file organization

The organization of harvested files permits a direct access to the PDF based on the arxiv identifier. More particularly, the Open Access link given for an arXiv resource by Unpaywall is enough to create a direct access path. It also avoids storing too many files in the same directory for performance reasons.

The stored PDF is always the most recent version. There is no need to know what is the exact latest version (an information that we don't have with the Unpaywall arXiv full text links for example). The local metadata file for the article gives the version number of the stored PDF.

For example, to get access path from the identifiers or Unpaywall OA url:

  • post-2007 arXiv identifiers (pattern arXiv:YYMM.numbervV or commonly YYMM.numbervV):

    • 1501.00001v1 -> $root/arXiv/1501/1501.00001/1501.00001.pdf (most recent version of the PDF), $root/arXiv/1501/1501.00001/1501.00001.json (arXiv metadata for the article)
    • Unpaywall link http://arxiv.org/pdf/1501.00001 -> $root/arXiv/1501/1501.00001/1501.00001.pdf, $root/arXiv/1501/1501.00001/1501.00001.json
  • pre-2007 arXiv identifiers (pattern archive.subject_call/YYMMnumber):

    • quant-ph/0602109 -> $root/quant-ph/0602/0602109/0602109.pdf (most recent version of the PDF), $root/quant-ph/0602/0602109/0602109.json (arXiv metadata for the article)

    • Unpaywall link https://arxiv.org/pdf/quant-ph/0602109 -> $root/quant-ph/0602/0602109/0602109.pdf, $root/quant-ph/0602/0602109/0602109.json

If the compression option is set to True in the configuration file config.json, all the resources have an additional .gz extension.

$root in the above examples should be adapted to the storage of choice, as configured in the configuration file config.json. For instance with AWS S3: https://bucket_name.s3.amazonaws.com/arXiv/1501/1501.00001/1501.00001.pdf (if access rights are appropriate). The same applies to a SWIFT object storage based on the container name indicated in the config file.

AWS S3 and SWIFT configuration

For a local storage, just indicate the path where to store the PDF with the parameter data_path in the configuration file config.json.

The configuration for a S3 storage uses the following parameters:

{
    "aws_access_key_id": "",
    "aws_secret_access_key": "",
    "bucket_name": "",
    "region": ""
}

If you are not using a S3 storage, remove these keys or leave these values empty.

The configuration for a SWIFT object storage uses the following parameters:

{
    "swift": {},
    "swift_container": ""
}

If you are not using a SWIFT storage, remove these keys or leave these above values empty.

The "swift" key will contain the account and authentication information, typically via Keystone, something like this:

{
    "swift": {
        "auth_version": "3",
        "auth_url": "https://auth......./v3",
        "os_username": "user-007",
        "os_password": "1234",
        "os_user_domain_name": "Default",
        "os_project_domain_name": "Default",
        "os_project_name": "myProjectName",
        "os_project_id": "myProjectID",
        "os_region_name": "NorthPole",
        "os_auth_url": "https://auth......./v3"
    },
    "swift_container": "my_arxiv_harvesting"
}

Limitations

Source files (LaTeX sources) are not available via the Kaggle dataset and thus via this modest harvester. The LaTeX source files are available via AWS S3 Bulk Source File Access.

There are 44 articles only available in HTML format. These articles will not be harvested.

Acknowledgements

Kaggle arXiv dataset relies on arxiv-public-datasets:

Clement, C. B., Bierbaum, M., O'Keeffe, K. P., & Alemi, A. A. (2019). On the Use of ArXiv as a Dataset. arXiv preprint arXiv:1905.00075.

License and contact

This modest tool is distributed under Apache 2.0 license. The dependencies used in the project are either themselves also distributed under Apache 2.0 license or distributed under a compatible license.

If you contribute to this Open Source project, you agree to share your contribution following this license.

Kaggle dataset arXiv Metadata is distributed under CC0 1.0 license. Note that most articles on arXiv are submitted with the default arXiv license, which does usually not allow redistribution. See here about the possible usage of the harvested PDF.

Main author and contact: Patrice Lopez ([email protected])

Owner
Patrice Lopez
Patrice Lopez
Discriminative Region Suppression for Weakly-Supervised Semantic Segmentation

Discriminative Region Suppression for Weakly-Supervised Semantic Segmentation (AAAI 2021) Official pytorch implementation of our paper: Discriminative

Beom 74 Dec 27, 2022
3rd place solution for the Weather4cast 2021 Stage 1 Challenge

weather4cast2021_Stage1 3rd place solution for the Weather4cast 2021 Stage 1 Challenge Dependencies The code can be executed from a fresh environment

5 Aug 14, 2022
Random Forests for Regression with Missing Entries

Random Forests for Regression with Missing Entries These are specific codes used in the article: On the Consistency of a Random Forest Algorithm in th

Irving Gómez-Méndez 1 Nov 15, 2021
Deepfake Scanner by Deepware.

Deepware Scanner (CLI) This repository contains the command-line deepfake scanner tool with the pre-trained models that are currently used at deepware

deepware 110 Jan 02, 2023
OCRA (Object-Centric Recurrent Attention) source code

OCRA (Object-Centric Recurrent Attention) source code Hossein Adeli and Seoyoung Ahn Please cite this article if you find this repository useful: For

Hossein Adeli 2 Jun 18, 2022
A New Approach to Overgenerating and Scoring Abstractive Summaries

We provide the source code for the paper "A New Approach to Overgenerating and Scoring Abstractive Summaries" accepted at NAACL'21. If you find the code useful, please cite the following paper.

Kaiqiang Song 4 Apr 03, 2022
DynaTune: Dynamic Tensor Program Optimization in Deep Neural Network Compilation

DynaTune: Dynamic Tensor Program Optimization in Deep Neural Network Compilation This repository is the implementation of DynaTune paper. This folder

4 Nov 02, 2022
Code for generating a single image pretraining dataset

Single Image Pretraining of Visual Representations As shown in the paper A critical analysis of self-supervision, or what we can learn from a single i

Yuki M. Asano 12 Dec 19, 2022
PyTorch implementation for our AAAI 2022 Paper "Graph-wise Common Latent Factor Extraction for Unsupervised Graph Representation Learning"

deepGCFX PyTorch implementation for our AAAI 2022 Paper "Graph-wise Common Latent Factor Extraction for Unsupervised Graph Representation Learning" Pr

Thilini Cooray 4 Aug 11, 2022
Clustering with variational Bayes and population Monte Carlo

pypmc pypmc is a python package focusing on adaptive importance sampling. It can be used for integration and sampling from a user-defined target densi

45 Feb 06, 2022
[ACM MM 2021] Yes, "Attention is All You Need", for Exemplar based Colorization

Transformer for Image Colorization This is an implemention for Yes, "Attention Is All You Need", for Exemplar based Colorization, and the current soft

Wang Yin 30 Dec 07, 2022
hipCaffe: the HIP port of Caffe

Caffe Caffe is a deep learning framework made with expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Cent

ROCm Software Platform 126 Dec 05, 2022
Neural Nano-Optics for High-quality Thin Lens Imaging

Neural Nano-Optics for High-quality Thin Lens Imaging Project Page | Paper | Data Ethan Tseng, Shane Colburn, James Whitehead, Luocheng Huang, Seung-H

Ethan Tseng 39 Dec 05, 2022
A tensorflow model that predicts if the image is of a cat or of a dog.

Quick intro Hello and thank you for your interest in my project! This is the backend part of a two-repo application. The other part can be found here

Tudor Matei 0 Mar 08, 2022
DeepFaceEditing: Deep Face Generation and Editing with Disentangled Geometry and Appearance Control

DeepFaceEditing: Deep Face Generation and Editing with Disentangled Geometry and Appearance Control One version of our system is implemented using the

260 Nov 28, 2022
Multi-Task Pre-Training for Plug-and-Play Task-Oriented Dialogue System

Multi-Task Pre-Training for Plug-and-Play Task-Oriented Dialogue System Authors: Yixuan Su, Lei Shu, Elman Mansimov, Arshit Gupta, Deng Cai, Yi-An Lai

Amazon Web Services - Labs 123 Dec 23, 2022
PyTorch implementation of PSPNet segmentation network

pspnet-pytorch PyTorch implementation of PSPNet segmentation network Original paper Pyramid Scene Parsing Network Details This is a slightly different

Roman Trusov 532 Dec 29, 2022
Official implementation of "GS-WGAN: A Gradient-Sanitized Approach for Learning Differentially Private Generators" (NeurIPS 2020)

GS-WGAN This repository contains the implementation for GS-WGAN: A Gradient-Sanitized Approach for Learning Differentially Private Generators (NeurIPS

46 Nov 09, 2022
Python Fanduel API (2021) - Lineup Automation

Southpaw is a python package that provides access to the Fanduel API. Optimize your DFS experience by programmatically updating your lineups, analyzin

Brandin Canfield 13 Jan 04, 2023
LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT

LightHuBERT LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT | Github | Huggingface | SUPER

WangRui 46 Dec 29, 2022