A Python client for the Softcite software mention recognizer server

Overview

Softcite software mention recognizer client

Python client for using the Softcite software mention recognition service. It can be applied to

  • individual PDF files

  • recursively to a local directory, processing all the encountered PDF

  • to a collection of documents harvested by biblio-glutton-harvester and article-dataset-builder, with the benefit of re-using the collection manifest for injectng metadata and keeping track of progress. The collection can be stored locally or on a S3 storage.

Requirements

The client has been tested with Python 3.5-3.7.

The client requires a working Softcite software mention recognition service. Service host and port can be changed in the config.json file of the client.

Install

cd software_mention_client/

It is advised to setup first a virtual environment to avoid falling into one of these gloomy python dependency marshlands:

virtualenv --system-site-packages -p python3 env

source env/bin/activate

Install the dependencies, use:

pip3 install -r requirements.txt

Usage and options

usage: software_mention_client.py [-h] [--repo-in REPO_IN] [--file-in FILE_IN]
                                  [--file-out FILE_OUT]
                                  [--data-path DATA_PATH] [--config CONFIG]
                                  [--reprocess] [--reset] [--load]
                                  [--diagnostic] [--scorched-earth]

Softcite software mention recognizer client

optional arguments:
  -h, --help            show this help message and exit
  --repo-in REPO_IN     path to a directory of PDF files to be processed by
                        the Softcite software mention recognizer
  --file-in FILE_IN     a single PDF input file to be processed by the
                        Softcite software mention recognizer
  --file-out FILE_OUT   path to a single output the software mentions in JSON
                        format, extracted from the PDF file-in
  --data-path DATA_PATH
                        path to the resource files created/harvested by
                        biblio-glutton-harvester
  --config CONFIG       path to the config file, default is ./config.json
  --reprocess           reprocessed failed PDF
  --reset               ignore previous processing states and re-init the
                        annotation process from the beginning
  --load                load json files into the MongoDB instance, the --repo-
                        in parameter must indicate the path to the directory
                        of resulting json files to be loaded
  --diagnostic          perform a full count of annotations and diagnostic
                        using MongoDB regarding the harvesting and
                        transformation process
  --scorched-earth      remove a PDF file after its successful processing in
                        order to save storage space, careful with this!

The logs are written by default in a file ./client.log, but the location of the logs can be changed in the configuration file (default ./config.json).

Processing local PDF files

For processing a single file., the resulting json being written as file at the indicated output path:

python3 software_mention_client.py --file-in toto.pdf --file-out toto.json

For processing recursively a directory of PDF files, the results will be:

  • written to a mongodb server and database indicated in the config file

  • and in the directory of PDF files, as json files, together with each processed PDF

python3 software_mention_client.py --repo-in /mnt/data/biblio/pmc_oa_dir/

The default config file is ./config.json, but could also be specified via the parameter --config:

python3 software_mention_client.py --repo-in /mnt/data/biblio/pmc_oa_dir/ --config ./my_config.json

Processing a collection of PDF harvested by biblio-glutton-harvester

biblio-glutton-harvester and article-dataset-builder creates a collection manifest as a LMDB database to keep track of the harvesting of large collection of files. Storage of the resource can be located on a local file system or on a AWS S3 storage. The software-mention client will use the collection manifest to process these harvested documents.

  • locally:

python3 software_mention_client.py --data-path /mnt/data/biblio-glutton-harvester/data/

--data-path indicates the path to the repository of data harvested by biblio-glutton-harvester.

The resulting JSON files will be enriched by the metadata records of the processed PDF and will be stored together with each processed PDF in the data repository.

If the harvested collection is located on a S3 storage, the access information must be indicated in the configuration file of the client config.json. The extracted software mention will be written in a file with extension .software.json, for example:

-rw-rw-r-- 1 lopez lopez 1.1M Aug  8 03:26 0100a44b-6f3f-4cf7-86f9-8ef5e8401567.pdf
-rw-rw-r-- 1 lopez lopez  485 Aug  8 03:41 0100a44b-6f3f-4cf7-86f9-8ef5e8401567.software.json

If a MongoDB server access information is indicated in the configuration file config.json, the extracted information will additionally be written in MongoDB.

License and contact

Distributed under Apache 2.0 license. The dependencies used in the project are either themselves also distributed under Apache 2.0 license or distributed under a compatible license.

Main author and contact: Patrice Lopez ([email protected])

A basic implementation of the Battlesnake API in Python

Getting started with Battlesnake and Python This is a basic implementation of the Battlesnake API in Python. It's a great starting point for anyone wa

Gaurav Batra 2 Dec 08, 2021
Google Sheets Python API v4

pygsheets - Google Spreadsheets Python API v4 A simple, intuitive library for google sheets which gets your work done. Features: Open, create, delete

Nithin Murali 1.4k Jan 08, 2023
API RestFull de uma clinica, onde vai efetuar os agendamentos dos pacientes e mostrar o historicos de cada agendamentos

API REstFull O que tem na API Usado para clinicas. Cadastro de pacientes. Agendamentos de pacientes. Históricos dos agendamentos vinculados com a tabe

Lucas Silva 3 Aug 29, 2022
Бот Telegram для Школы в Капотне (ЦО № 1858)

co1858 Telegram Bot Активно разрабатывался в 2015-2016 году как учебный проект, с целью научиться создавать ботов для Telegram. Бот автоматически парс

Ilya Pavlov 4 Aug 30, 2022
Install and manage Proton-GE and Luxtorpeda for Steam and Wine-GE for Lutris with this graphical user interface. Based on AUNaseef's ProtonUp, made with Python 3 and Qt 6.

ProtonUp-Qt Qt-based graphical user interface to install and manage Proton-GE installations for Steam and Wine-GE installations for Lutris. Based on A

638 Jan 02, 2023
Python client library for Postmark API

Postmarker Python client library for Postmark API. Gitter: https://gitter.im/Stranger6667/postmarker Installation Postmarker can be obtained with pip:

Dmitry Dygalo 109 Dec 13, 2022
Hydro Quebec API wrapper.

HydroQC Hydro Quebec API wrapper. This is a package to access some functionalities of Hydro Quebec API that are not documented. Documentation https://

Olivier BEAU 9 Dec 02, 2022
A discord bot with a leveling system (similar to mee6).

Discord.py A discord bot with a leveling system (like mee6) Pre-requisites Knowing how to get create an app/bot via discord's developer portal. Websit

26 Dec 11, 2022
A free sniper bot built to work with PancakeSwap: Router V2

Pancakeswap Sniper Bot PancakeSwap sniper bot. Automated sniping bot to snipe crypto coin launches. How it works The sniping bot can be used in three

89 Aug 06, 2022
Simple software that can send WhatsApp message to a single or multiple users (including unsaved number**)

wp-automation Info: this is a simple automation software that sends WhatsApp message to single or multiple users. Key feature: -Sends message to multi

3 Jan 31, 2022
A discord bot can stress ip addresses with python tool

Python-ddos-bot Coded by Lamp#1442 A discord bot can stress ip addresses with python tool. Warning! DOS or DDOS is illegal, i shared for educational p

IrgyGANS 1 Nov 16, 2021
👾 Telegram Smart Group Assistant 🤖

DarkHelper 🌖 Features ⚡️ Smart anti-apam & anti-NFSW message checker Tag Members , Entertain facility , Welcommer ban , unban , mute , unmute , lock

amirali rajabi 38 Dec 18, 2022
A modified Sequential and NLP based Bot

A modified Sequential and NLP based Bot I improvised this bot a bit with some implementations as a part of my own hobby project :) Note: I do not own

Jay Desale 2 Jan 07, 2022
Automatically Forward files from groups to channel & FSub

Backup & ForceSub Automatically Forward files from groups to channel & Do force sub on members Variables API_ID : Get from my.telegram.org API_HASH :

Arunkumar Shibu 7 Nov 06, 2022
Framework to make using Bottle less time-consuming and easier

A class for the Bottle API to reduce clutter and difficulty while creating a website.

Tygzy 0 Dec 26, 2022
A Python API for Connected 2

connected API for Connected 2 api for the { connected 2 } programmer : api report api follow api check username api forget password api Search api cha

2 Jun 05, 2022
A cracking tool of Xiaomi Dr AI (Archytas / Archimedes)

Archytas Tool 我们强烈抵制闲鱼平台上未经授权的刷机服务! 我对本人之前在程序中为防止违规刷机服务添加未生效的格机代码感到抱歉,在此声明此过激行为与 Crack Mi Dr AI Team 无关,并将程序开源。 A cracking tool of Xiaomi Dr AI (Archy

rponeawa 5 Oct 25, 2022
Media Replay Engine (MRE) is a framework to build automated video clipping and replay (highlight) generation pipelines for live and video-on-demand content.

Media Replay Engine (MRE) is a framework for building automated video clipping and replay (highlight) generation pipelines using AWS services for live

Amazon Web Services - Labs 30 Nov 29, 2022
TON Miner from TON-Pool.com

TON-Pool Miner Miner from TON-Pool.com

21 Nov 18, 2022
Streaming Finance Data with AWS Lambda

A data pipeline consisting of an AWS lambda function reading data from yfinance API, an AWS Kinesis stream to receive & store data in S3 buckets and AWS Glue crawler & Athena to run SQL queries.

Aarif Munwar Jahan 4 Aug 30, 2022