Arxiv harvester - Poor man's simple harvester for arXiv resources

Overview

Poor man's simple harvester for arXiv resources

This modest Python script takes advantage of arXiv resources hosted by Kaggle to harvest arXiv metadata and PDF, without using the AWS requester paid buckets.

The harvester performs the following tasks:

  • parse the full JSON arXiv metadata file available at Kaggle

  • parallel download PDF located at the public access bucket gs://arxiv-dataset and store them (also in parallel) on a cloud storage, AWS S3 and Swift OpenStack supported, or on the local file system

  • store the metadata of the uploaded article along with the PDF in JSON format

To save storage space, only the most recent available version of the PDF for an article is harvested, not every available versions.

Resuming interrupted and incremental update are automatically supported.

In case an article is only available in postcript, it will be converted into PDF too - but it is extremely rare (and usually when it happens the conversion fails because the PostSript is corrupted...).

Install

The tool is supposed to work on a POSIX environment. External call to the following command lines are used: gzip, gunzip and ps2pdf.

First, download the full arXiv metadata JSON file available at https://www.kaggle.com/Cornell-University/arxiv (1GB compressed). It's actually a JSONL file (one JSON document per line), currently named arxiv-metadata-oai-snapshot.json.zip. You can also generate yourself this file with arxiv-public-dataset OAI harvester using the arXiv OAI-PMH service.

Get this github repo:

git clone https://github.com/kermitt2/arxiv_harvester
cd arxiv_harvester

Setup a virtual environment:

virtualenv --system-site-packages -p python3.8 env
source env/bin/activate

Install the dependencies:

pip3 install -r requirements.txt

Finally install the project in editable state:

pip3 install -e .

Usage

First check the configuration file:

  • set the parameters according to your selected storage (AWS S3, SWIFT OepnStack or local storage), see below for more details,
  • the default batch_size for parallel download/upload is 10, change it as you wish and dare,
  • by default gzip compression of files on the target storage is selected.
arXiv harvester

optional arguments:
  -h, --help           show this help message and exit
  --config CONFIG      path to the config file, default is ./config.json
  --reset              ignore previous processing states and re-init the harvesting process from
                       the beginning
  --metadata METADATA  arXiv metadata json file
  --diagnostic         produce a summary of the harvesting

For example, to harvest articles from a metadata snapshot file:

python3 arxiv_harvester/harvester.py --metadata arxiv-metadata-oai-snapshot.json.zip --config config.json

To reset an existing harvesting and starts the harvesting again from scratch, add the --reset argument:

python3 arxiv_harvester/harvester.py --metadata arxiv-metadata-oai-snapshot.json.zip --config config.json --reset

Note that with --reset, no actual stored PDF file is removed - only the harvesting process is reinitialized.

Interrupted harvesting / Incremental update

Launching the harvesting command on an interrupted harvesting will resume the harvesting automatically where it stopped.

If the arXiv metadata file has been updated to a newer version (downloaded from https://www.kaggle.com/Cornell-University/arxiv or generated with arxiv-public-dataset OAI harvester), launching the harvesting command on the updated metadata file will harvest only the new and updated articles (new most recent PDF version).

Resource file organization

The organization of harvested files permits a direct access to the PDF based on the arxiv identifier. More particularly, the Open Access link given for an arXiv resource by Unpaywall is enough to create a direct access path. It also avoids storing too many files in the same directory for performance reasons.

The stored PDF is always the most recent version. There is no need to know what is the exact latest version (an information that we don't have with the Unpaywall arXiv full text links for example). The local metadata file for the article gives the version number of the stored PDF.

For example, to get access path from the identifiers or Unpaywall OA url:

  • post-2007 arXiv identifiers (pattern arXiv:YYMM.numbervV or commonly YYMM.numbervV):

    • 1501.00001v1 -> $root/arXiv/1501/1501.00001/1501.00001.pdf (most recent version of the PDF), $root/arXiv/1501/1501.00001/1501.00001.json (arXiv metadata for the article)
    • Unpaywall link http://arxiv.org/pdf/1501.00001 -> $root/arXiv/1501/1501.00001/1501.00001.pdf, $root/arXiv/1501/1501.00001/1501.00001.json
  • pre-2007 arXiv identifiers (pattern archive.subject_call/YYMMnumber):

    • quant-ph/0602109 -> $root/quant-ph/0602/0602109/0602109.pdf (most recent version of the PDF), $root/quant-ph/0602/0602109/0602109.json (arXiv metadata for the article)

    • Unpaywall link https://arxiv.org/pdf/quant-ph/0602109 -> $root/quant-ph/0602/0602109/0602109.pdf, $root/quant-ph/0602/0602109/0602109.json

If the compression option is set to True in the configuration file config.json, all the resources have an additional .gz extension.

$root in the above examples should be adapted to the storage of choice, as configured in the configuration file config.json. For instance with AWS S3: https://bucket_name.s3.amazonaws.com/arXiv/1501/1501.00001/1501.00001.pdf (if access rights are appropriate). The same applies to a SWIFT object storage based on the container name indicated in the config file.

AWS S3 and SWIFT configuration

For a local storage, just indicate the path where to store the PDF with the parameter data_path in the configuration file config.json.

The configuration for a S3 storage uses the following parameters:

{
    "aws_access_key_id": "",
    "aws_secret_access_key": "",
    "bucket_name": "",
    "region": ""
}

If you are not using a S3 storage, remove these keys or leave these values empty.

The configuration for a SWIFT object storage uses the following parameters:

{
    "swift": {},
    "swift_container": ""
}

If you are not using a SWIFT storage, remove these keys or leave these above values empty.

The "swift" key will contain the account and authentication information, typically via Keystone, something like this:

{
    "swift": {
        "auth_version": "3",
        "auth_url": "https://auth......./v3",
        "os_username": "user-007",
        "os_password": "1234",
        "os_user_domain_name": "Default",
        "os_project_domain_name": "Default",
        "os_project_name": "myProjectName",
        "os_project_id": "myProjectID",
        "os_region_name": "NorthPole",
        "os_auth_url": "https://auth......./v3"
    },
    "swift_container": "my_arxiv_harvesting"
}

Limitations

Source files (LaTeX sources) are not available via the Kaggle dataset and thus via this modest harvester. The LaTeX source files are available via AWS S3 Bulk Source File Access.

There are 44 articles only available in HTML format. These articles will not be harvested.

Acknowledgements

Kaggle arXiv dataset relies on arxiv-public-datasets:

Clement, C. B., Bierbaum, M., O'Keeffe, K. P., & Alemi, A. A. (2019). On the Use of ArXiv as a Dataset. arXiv preprint arXiv:1905.00075.

License and contact

This modest tool is distributed under Apache 2.0 license. The dependencies used in the project are either themselves also distributed under Apache 2.0 license or distributed under a compatible license.

If you contribute to this Open Source project, you agree to share your contribution following this license.

Kaggle dataset arXiv Metadata is distributed under CC0 1.0 license. Note that most articles on arXiv are submitted with the default arXiv license, which does usually not allow redistribution. See here about the possible usage of the harvested PDF.

Main author and contact: Patrice Lopez ([email protected])

Owner
Patrice Lopez
Patrice Lopez
Code for EMNLP2021 paper "Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training"

VoCapXLM Code for EMNLP2021 paper Allocating Large Vocabulary Capacity for Cross-lingual Language Model Pre-training Environment DockerFile: dancingso

Bo Zheng 15 Jul 28, 2022
[WACV21] Code for our paper: Samuel, Atzmon and Chechik, "From Generalized zero-shot learning to long-tail with class descriptors"

DRAGON: From Generalized zero-shot learning to long-tail with class descriptors Paper Project Website Video Overview DRAGON learns to correct the bias

Dvir Samuel 25 Dec 06, 2022
All course materials for the Zero to Mastery Deep Learning with TensorFlow course.

All course materials for the Zero to Mastery Deep Learning with TensorFlow course.

Daniel Bourke 3.4k Jan 07, 2023
Compare neural networks by their feature similarity

PyTorch Model Compare A tiny package to compare two neural networks in PyTorch. There are many ways to compare two neural networks, but one robust and

Anand Krishnamoorthy 181 Jan 04, 2023
Multi-query Video Retreival

Multi-query Video Retreival

Princeton Visual AI Lab 17 Nov 22, 2022
transfer attack; adversarial examples; black-box attack; unrestricted Adversarial Attacks on ImageNet; CVPR2021 天池黑盒竞赛

transfer_adv CVPR-2021 AIC-VI: unrestricted Adversarial Attacks on ImageNet CVPR2021 安全AI挑战者计划第六期赛道2:ImageNet无限制对抗攻击 介绍 : 深度神经网络已经在各种视觉识别问题上取得了最先进的性能。

25 Dec 08, 2022
Awesome Human Pose Estimation

Human Pose Estimation Related Publication

Zhe Wang 1.2k Dec 26, 2022
Repo for the Video Person Clustering dataset, and code for the associated paper

Video Person Clustering Repo for the Video Person Clustering dataset, and code for the associated paper. This reporsitory contains the Video Person Cl

Andrew Brown 47 Nov 02, 2022
Interactive Image Segmentation via Backpropagating Refinement Scheme

Won-Dong Jang and Chang-Su Kim, Interactive Image Segmentation via Backpropagating Refinement Scheme, CVPR 2019

Won-Dong Jang 85 Sep 15, 2022
Apply a perspective transformation to a raster image inside Inkscape (no need to use an external software such as GIMP or Krita).

Raster Perspective Apply a perspective transformation to bitmap image using the selected path as envelope, without the need to use an external softwar

s.ouchene 19 Dec 22, 2022
Code for SALT: Stackelberg Adversarial Regularization, EMNLP 2021.

SALT: Stackelberg Adversarial Regularization Code for Adversarial Regularization as Stackelberg Game: An Unrolled Optimization Approach, EMNLP 2021. R

Simiao Zuo 10 Jan 10, 2022
An Exact Solver for Semi-supervised Minimum Sum-of-Squares Clustering

PC-SOS-SDP: an Exact Solver for Semi-supervised Minimum Sum-of-Squares Clustering PC-SOS-SDP is an exact algorithm based on the branch-and-bound techn

Antonio M. Sudoso 1 Nov 13, 2022
Network Enhancement implementation in pytorch

network_enahncement_pytorch Network Enhancement implementation in pytorch Research paper Network Enhancement: a general method to denoise weighted bio

Yen 1 Nov 12, 2021
Libtorch yolov3 deepsort

Overview It is for my undergrad thesis in Tsinghua University. There are four modules in the project: Detection: YOLOv3 Tracking: SORT and DeepSORT Pr

Xu Wei 226 Dec 13, 2022
ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

ONNX Runtime is a cross-platform inference and training machine-learning accelerator. ONNX Runtime inference can enable faster customer experiences an

Microsoft 8k Jan 04, 2023
A python3 tool to take a 360 degree survey of the RF spectrum (hamlib + rotctld + RTL-SDR/HackRF)

RF Light House (rflh) A python script to use a rotor and a SDR device (RTL-SDR or HackRF One) to measure the RF level around and get a data set and be

Pavel Milanes (CO7WT) 11 Dec 13, 2022
A curated list of long-tailed recognition resources.

Awesome Long-tailed Recognition A curated list of long-tailed recognition and related resources. Please feel free to pull requests or open an issue to

Zhiwei ZHANG 542 Jan 01, 2023
Relative Positional Encoding for Transformers with Linear Complexity

Stochastic Positional Encoding (SPE) This is the source code repository for the ICML 2021 paper Relative Positional Encoding for Transformers with Lin

Antoine Liutkus 48 Nov 16, 2022
General Multi-label Image Classification with Transformers

General Multi-label Image Classification with Transformers Jack Lanchantin, Tianlu Wang, Vicente Ordóñez Román, Yanjun Qi Conference on Computer Visio

QData 154 Dec 21, 2022
机器学习、深度学习、自然语言处理等人工智能基础知识总结。

说明 机器学习、深度学习、自然语言处理基础知识总结。 目前主要参考李航老师的《统计学习方法》一书,也有一些内容例如XGBoost、聚类、深度学习相关内容、NLP相关内容等是书中未提及的。

Peter 445 Dec 12, 2022