A Python client for the Softcite software mention recognizer server

Overview

Softcite software mention recognizer client

Python client for using the Softcite software mention recognition service. It can be applied to

  • individual PDF files

  • recursively to a local directory, processing all the encountered PDF

  • to a collection of documents harvested by biblio-glutton-harvester and article-dataset-builder, with the benefit of re-using the collection manifest for injectng metadata and keeping track of progress. The collection can be stored locally or on a S3 storage.

Requirements

The client has been tested with Python 3.5-3.7.

The client requires a working Softcite software mention recognition service. Service host and port can be changed in the config.json file of the client.

Install

cd software_mention_client/

It is advised to setup first a virtual environment to avoid falling into one of these gloomy python dependency marshlands:

virtualenv --system-site-packages -p python3 env

source env/bin/activate

Install the dependencies, use:

pip3 install -r requirements.txt

Usage and options

usage: software_mention_client.py [-h] [--repo-in REPO_IN] [--file-in FILE_IN]
                                  [--file-out FILE_OUT]
                                  [--data-path DATA_PATH] [--config CONFIG]
                                  [--reprocess] [--reset] [--load]
                                  [--diagnostic] [--scorched-earth]

Softcite software mention recognizer client

optional arguments:
  -h, --help            show this help message and exit
  --repo-in REPO_IN     path to a directory of PDF files to be processed by
                        the Softcite software mention recognizer
  --file-in FILE_IN     a single PDF input file to be processed by the
                        Softcite software mention recognizer
  --file-out FILE_OUT   path to a single output the software mentions in JSON
                        format, extracted from the PDF file-in
  --data-path DATA_PATH
                        path to the resource files created/harvested by
                        biblio-glutton-harvester
  --config CONFIG       path to the config file, default is ./config.json
  --reprocess           reprocessed failed PDF
  --reset               ignore previous processing states and re-init the
                        annotation process from the beginning
  --load                load json files into the MongoDB instance, the --repo-
                        in parameter must indicate the path to the directory
                        of resulting json files to be loaded
  --diagnostic          perform a full count of annotations and diagnostic
                        using MongoDB regarding the harvesting and
                        transformation process
  --scorched-earth      remove a PDF file after its successful processing in
                        order to save storage space, careful with this!

The logs are written by default in a file ./client.log, but the location of the logs can be changed in the configuration file (default ./config.json).

Processing local PDF files

For processing a single file., the resulting json being written as file at the indicated output path:

python3 software_mention_client.py --file-in toto.pdf --file-out toto.json

For processing recursively a directory of PDF files, the results will be:

  • written to a mongodb server and database indicated in the config file

  • and in the directory of PDF files, as json files, together with each processed PDF

python3 software_mention_client.py --repo-in /mnt/data/biblio/pmc_oa_dir/

The default config file is ./config.json, but could also be specified via the parameter --config:

python3 software_mention_client.py --repo-in /mnt/data/biblio/pmc_oa_dir/ --config ./my_config.json

Processing a collection of PDF harvested by biblio-glutton-harvester

biblio-glutton-harvester and article-dataset-builder creates a collection manifest as a LMDB database to keep track of the harvesting of large collection of files. Storage of the resource can be located on a local file system or on a AWS S3 storage. The software-mention client will use the collection manifest to process these harvested documents.

  • locally:

python3 software_mention_client.py --data-path /mnt/data/biblio-glutton-harvester/data/

--data-path indicates the path to the repository of data harvested by biblio-glutton-harvester.

The resulting JSON files will be enriched by the metadata records of the processed PDF and will be stored together with each processed PDF in the data repository.

If the harvested collection is located on a S3 storage, the access information must be indicated in the configuration file of the client config.json. The extracted software mention will be written in a file with extension .software.json, for example:

-rw-rw-r-- 1 lopez lopez 1.1M Aug  8 03:26 0100a44b-6f3f-4cf7-86f9-8ef5e8401567.pdf
-rw-rw-r-- 1 lopez lopez  485 Aug  8 03:41 0100a44b-6f3f-4cf7-86f9-8ef5e8401567.software.json

If a MongoDB server access information is indicated in the configuration file config.json, the extracted information will additionally be written in MongoDB.

License and contact

Distributed under Apache 2.0 license. The dependencies used in the project are either themselves also distributed under Apache 2.0 license or distributed under a compatible license.

Main author and contact: Patrice Lopez ([email protected])

Python Client for Yandex Cloud Logging

Python Client for Yandex Cloud Logging Installation pip3 install python-yandex-cloud-logging Creating a Yandex Cloud Logging Group yc logging group c

MCode 0 Dec 08, 2021
Simple integrate of API udemy.com with python

Pyudemy Simple integrate of API udemy.com with python Quick start $ pip install pyudemy or $ python setup.py install Authentication To make any calls

Hudson Brendon 30 Jan 02, 2023
The implementation of Learning Instance and Task-Aware Dynamic Kernels for Few Shot Learning

INSTA: Learning Instance and Task-Aware Dynamic Kernels for Few Shot Learning This repository provides the implementation and demo of Learning Instanc

11 Jan 02, 2023
A fast, easy to set up telegram userbot running Python 3 which uses fork of the Telethon Library.

forked from friendly-telegram/friendly-telegram Friendly Telegram Userbot A fast, easy to set up telegram userbot running Python 3 which uses fork of

GeekTG 75 Jan 04, 2023
An API Client package to access the APIs for NBA.com

nba_api An API Client package to access the APIs for NBA.com Development Version: v1.1.9 nba_api is an API Client for www.nba.com. This package is mea

Swar Patel 1.4k Jan 01, 2023
Pycardano - A lightweight Cardano client in Python

PyCardano PyCardano is a standalone Cardano client written in Python. The librar

151 Dec 31, 2022
Terraform Cloud CLI for Managing Workspace Terraform Versions

Terraform Cloud Version Manager This tiny script makes it easy to update the Terraform Version on all of the Workspaces inside Terraform Cloud. It wil

Robert Hafner 1 Jan 07, 2022
DaProfiler vous permet d'automatiser vos recherches sur des particuliers basés en France uniquement et d'afficher vos résultats sous forme d'arbre.

A but educatif seulement. DaProfiler DaProfiler vous permet de créer un profil sur votre target basé en France uniquement. La particularité de ce prog

Dalunacrobate 73 Dec 21, 2022
A self-hosted Discord music bot.

Cassette A self-hosted Discord music bot. Requirements py-cord pynacl pytube Setup Intended to be hosted on Heroku. Fork or clone this repo. Create a

Lohan 8 Apr 28, 2022
A telegram bot written in Python to fetch random SFW & NSFW anime images

Tsuzumi A telegram bot written in python to fetch both random SFW & NSFW Anime images using nekos.life & waifu.pics API Commands SFW Commands : /

Nisarga Adhikary 3 Oct 12, 2022
Python script to decode the EU Covid-19 vaccine certificate

vacdec Python script to decode the EU Covid-19 vaccine certificate This script takes an image with a QR code of a vaccine certificate as the parameter

Hanno Böck 244 Nov 30, 2022
Utility for converting IP Fabric webhooks into a Teams format

IP Fabric Webhook Integration for Microsoft Teams and/or Slack Setup IP Fabric Setup Go to Settings Webhooks Add webhook Provide a name URL will b

Community Fabric 1 Jan 26, 2022
An API wrapper around the pythonanywhere's API.

pyaww An API wrapper around the pythonanywhere's API. The name stands for pythonanywherewrapper. 100% API coverage Most of the codebase is documented

7 Dec 11, 2022
A python wrapper for interacting with the LabArchives API.

LabArchives API wrapper for Python A python wrapper for interacting with the LabArchives API. This very simple package makes it easier to make arbitra

Marek Cmero 3 Aug 01, 2022
Keypirinha plugin to install packages via Chocolatey

Keypiriniha Chocolatey This is a package for the fast keystroke launcher keypirinha (http://keypirinha.com/) It allows you to search & install package

Shadab Zafar 4 Nov 26, 2022
PRNT.sc Image Grabber

PRNTSender PRNT.sc Image Grabber PRNTSender is a script that takes images posted on PRNT.sc and sends them to a Discord webhook, if you want to know h

neox 2 Dec 10, 2021
The Most Simple yet Powerful and Advanced Google Colab Notebook for Zip, Unzip, Tar, UnTar, RaR, UnRaR Files in Google Drive

The Most Simple yet Powerful and Advanced Google Colab Notebook for Zip, Unzip, Tar, UnTar, RaR, UnRaR Files in Google Drive

Dr.Caduceus 15 Aug 16, 2022
This is a Anti Channel Ban Robots

AntiChannelBan This is a Anti Channel Ban Robots delete and ban message sent by channels Heroku Deployment 💜 Heroku is the best way to host ur Projec

BᵣₐyDₑₙ 25 Dec 10, 2021
E-Commerce Telegram Bot for UCA Students

ucaStudentStore To buy from and sell to other students Features Register the first time, after that you will always be recognised You can login either

Shukur Sabzaliev 5 Jun 26, 2022
[Multithreading] [Proxy - auto & infile]

Discord-Token-Generator-AutoCheck [Multithreading] [Proxy - auto & infile] How to install? pip install -r requirements.txt run generator.py with pytho

Chakeaw__ 3 Oct 17, 2021