A Python client for the Softcite software mention recognizer server

Overview

Softcite software mention recognizer client

Python client for using the Softcite software mention recognition service. It can be applied to

  • individual PDF files

  • recursively to a local directory, processing all the encountered PDF

  • to a collection of documents harvested by biblio-glutton-harvester and article-dataset-builder, with the benefit of re-using the collection manifest for injectng metadata and keeping track of progress. The collection can be stored locally or on a S3 storage.

Requirements

The client has been tested with Python 3.5-3.7.

The client requires a working Softcite software mention recognition service. Service host and port can be changed in the config.json file of the client.

Install

cd software_mention_client/

It is advised to setup first a virtual environment to avoid falling into one of these gloomy python dependency marshlands:

virtualenv --system-site-packages -p python3 env

source env/bin/activate

Install the dependencies, use:

pip3 install -r requirements.txt

Usage and options

usage: software_mention_client.py [-h] [--repo-in REPO_IN] [--file-in FILE_IN]
                                  [--file-out FILE_OUT]
                                  [--data-path DATA_PATH] [--config CONFIG]
                                  [--reprocess] [--reset] [--load]
                                  [--diagnostic] [--scorched-earth]

Softcite software mention recognizer client

optional arguments:
  -h, --help            show this help message and exit
  --repo-in REPO_IN     path to a directory of PDF files to be processed by
                        the Softcite software mention recognizer
  --file-in FILE_IN     a single PDF input file to be processed by the
                        Softcite software mention recognizer
  --file-out FILE_OUT   path to a single output the software mentions in JSON
                        format, extracted from the PDF file-in
  --data-path DATA_PATH
                        path to the resource files created/harvested by
                        biblio-glutton-harvester
  --config CONFIG       path to the config file, default is ./config.json
  --reprocess           reprocessed failed PDF
  --reset               ignore previous processing states and re-init the
                        annotation process from the beginning
  --load                load json files into the MongoDB instance, the --repo-
                        in parameter must indicate the path to the directory
                        of resulting json files to be loaded
  --diagnostic          perform a full count of annotations and diagnostic
                        using MongoDB regarding the harvesting and
                        transformation process
  --scorched-earth      remove a PDF file after its successful processing in
                        order to save storage space, careful with this!

The logs are written by default in a file ./client.log, but the location of the logs can be changed in the configuration file (default ./config.json).

Processing local PDF files

For processing a single file., the resulting json being written as file at the indicated output path:

python3 software_mention_client.py --file-in toto.pdf --file-out toto.json

For processing recursively a directory of PDF files, the results will be:

  • written to a mongodb server and database indicated in the config file

  • and in the directory of PDF files, as json files, together with each processed PDF

python3 software_mention_client.py --repo-in /mnt/data/biblio/pmc_oa_dir/

The default config file is ./config.json, but could also be specified via the parameter --config:

python3 software_mention_client.py --repo-in /mnt/data/biblio/pmc_oa_dir/ --config ./my_config.json

Processing a collection of PDF harvested by biblio-glutton-harvester

biblio-glutton-harvester and article-dataset-builder creates a collection manifest as a LMDB database to keep track of the harvesting of large collection of files. Storage of the resource can be located on a local file system or on a AWS S3 storage. The software-mention client will use the collection manifest to process these harvested documents.

  • locally:

python3 software_mention_client.py --data-path /mnt/data/biblio-glutton-harvester/data/

--data-path indicates the path to the repository of data harvested by biblio-glutton-harvester.

The resulting JSON files will be enriched by the metadata records of the processed PDF and will be stored together with each processed PDF in the data repository.

If the harvested collection is located on a S3 storage, the access information must be indicated in the configuration file of the client config.json. The extracted software mention will be written in a file with extension .software.json, for example:

-rw-rw-r-- 1 lopez lopez 1.1M Aug  8 03:26 0100a44b-6f3f-4cf7-86f9-8ef5e8401567.pdf
-rw-rw-r-- 1 lopez lopez  485 Aug  8 03:41 0100a44b-6f3f-4cf7-86f9-8ef5e8401567.software.json

If a MongoDB server access information is indicated in the configuration file config.json, the extracted information will additionally be written in MongoDB.

License and contact

Distributed under Apache 2.0 license. The dependencies used in the project are either themselves also distributed under Apache 2.0 license or distributed under a compatible license.

Main author and contact: Patrice Lopez ([email protected])

The successor of GeoSnipe, a pythonic Minecraft username sniper based on AsyncIO.

OneSnipe The successor of GeoSnipe, a pythonic Minecraft username sniper based on AsyncIO. Documentation View Documentation Features • Mojang & Micros

1 Jan 14, 2022
TwitterBot-ImageCollector - Twitter bot that collects images from likes saves the image

TwitterBot-ImageCollector Bot de Twitter que recolecta imagenes a partir de los

Gx3 Studios 4 Jun 01, 2022
Aplicação dos metodos de classificação em 3 diferentes banco de dados. Usando...

Machine Learning - Métodos de classificação Base de Dados utilizadas: Dados de crédito Dados do Census Métodos de classificação aplicados: Naive Bayes

1 Jan 18, 2022
A quick and dirty script to scan the network, find default credentials on services and post a message to a Slack channel with the results.

A quick and dirty script to scan the network, find default credentials on services and post a message to a Slack channel with the results.

Security Weekly 11 Jun 03, 2022
A Python wrapper for the tesseract-ocr API

tesserocr A simple, Pillow-friendly, wrapper around the tesseract-ocr API for Optical Character Recognition (OCR). tesserocr integrates directly with

Fayez 1.7k Jan 03, 2023
Pinopoly is a tool to remove the "banker" player and replace them with a digitalized system

Pinopoly is a tool to remove the "banker" player and replace them with a digitalized system. It is intended to be used on a Raspberry Pi but can be used in the command line as well.

Alex Overstreet 11 Jul 09, 2022
:snake: Python SDK to query Scaleway APIs.

Scaleway SDK Python SDK to query Scaleway's APIs. Stable release: Development: Installation The package is available on pip. To install it in a virtua

Scaleway 114 Dec 11, 2022
discord token grabber scam - eductional purposes only!

Discord-QR-Scam תופס אסימון תמונה של Discord על אודות סקריפט Python שיוצר אוטומטית קוד QR הונאה של Nitro ותופס את אסימון הדיסקורד בעת סריקה. כלי זה מד

Amit Pinchasi 0 May 22, 2022
A project that automatically sends you a Medium article on a topic of your choosing to your email address daily.

Daily Article from Medium ✏️ About A project that automatically sends you a Medium article on a topic of your choosing to your email address daily. No

Orhan Emre Dikicigil 2 Apr 27, 2022
Clubhouse API written in Python. Standalone client included. For reference and education purposes only.

clubhouse-py is originally developed for the sake of interoperability. Standalone client is also created with very basic features, including but not limited to the audio-chat

1.7k Jan 05, 2023
discord token grabber using python

Discord Token Grabber A Discord token grabber written in Python 3. This version of the grabber only supports Windows. Features No local caching Transf

1 Oct 28, 2021
A wrapper for aqquiring Choice Coin directly through a Python Terminal. Leverages the TinyMan Python-SDK.

CHOICE_TinyMan_Wrapper A wrapper that allows users to acquire Choice Coin directly through their Terminal using ALGO and various Algorand Standard Ass

Choice Coin 16 Sep 24, 2022
A telegram bot writen in python for mirroring files on the internet to Google Drive

owner of this repo :- AYUSH contact me :- AYUSH Slam Mirror Bot This is a telegram bot writen in python for mirroring files on the internet to our bel

Thanusara Pasindu 1 Nov 21, 2021
A python bot that stops muck chains

muck-chains-stopper-bot a bot that stops muck chains this is the source code of u/DaniDevChainBreaker (the main r/DaniDev muck chains breaker) guys th

24 Jan 04, 2023
Discord Crypto Payment Cards Selfbot

A Discord selfbot that serves the purpose of displaying text and QR versions of your BTC, LTC & ETH payment information for easy and simple commercial or personal transactions.

2 Apr 12, 2022
Running Performance Calculator

Running Performance Calculator 👉 Have you ever wondered if you ran 10km at 2000

Davide Liu 6 Oct 26, 2022
Python Capfire API wrapper

General CampfireAPI based on Campfire web. Install pip install Campfire-API Quickstart Use it without login: from campfire_api import CampfireAPI cf

Ghost 0 Jan 03, 2022
A Telegram Bot with(Forwarder Bot + User Bot + More Features )

A Telegram Bot with(Forwarder Bot + User Bot + More Features )

Kaif 3 Feb 16, 2022
Automated endpoint management for Amazon Aurora Global Database

This sample code can be used to manage Aurora global database endpoints. After failover the global database writer endpoints swap from one region to the other. This solution automates creation and ma

AWS Samples 13 Dec 08, 2022