Scans pdfs for links written in plaintext and checks if they are active or returns an error code.

Last update: Nov 21, 2022

Overview

Introduction

Scans pdfs for links written in plaintext and checks if they are active or returns an error code. It then generates a report of its findings. Extract references (pdf, url, doi, arxiv) and metadata from a PDF.

Features

Extract references and metadata from a given PDF.
Detects pdf, url, arxiv and doi references.
Checks for valid SSL certificate.
Find broken hyperlinks (using the -c flag).
Output as text or JSON (using the -j flag).
Extract the PDF text (using the --text flag).
Use as command-line tool or Python package.
Works with local and online pdfs.

Installation

Grab a copy of the code with pip:

pip install linkrot

Usage

linkrot can be used to extract info from a PDF in two ways:

Command line/Terminal tool linkrot
Python library import linkrot

1. Command Line/Terminal tool

linkrot [pdf-file-or-url]

Run linkrot -h to see the help output:

linkrot -h

usage:

linkrot [-h] [-d OUTPUT_DIRECTORY] [-c] [-j] [-v] [-t] [-o OUTPUT_FILE] [--version] pdf

Extract metadata and references from a PDF, and optionally download all referenced PDFs.

Arguments

positional arguments:

pdf (Filename or URL of a PDF file)

optional arguments:

-h, --help            (Show this help message and exit)  
-d OUTPUT_DIRECTORY,  --download-pdfs OUTPUT_DIRECTORY (Download all referenced PDFs into specified directory)  
-c, --check-links     (Check for broken links)  
-j, --json            (Output infos as JSON (instead of plain text))  
-v, --verbose         (Print all references (instead of only PDFs))  
-t, --text            (Only extract text (no metadata or references))  
-o OUTPUT_FILE,        --output-file OUTPUT_FILE (Output to specified file instead of console)  
--version             (Show program's version number and exit)

Examples

Extract text to console

linkrot https://example.com/example.pdf -t

Extract text to file

linkrot https://example.com/example.pdf -t -o pdf-text.txt

Check Links

linkrot https://example.com/example.pdf -c

2. Main Python Library

Import the library:

import linkrot

Create an instance of the linkrot class like so:

pdf = linkrot.linkrot("filename-or-url.pdf") #pdf is the instance of the linkrot class

Now the following function can be used to extract specific data from the pdf:

get_metadata()

Arguments: None

Usage:

metadata = pdf.get_metadata() #pdf is the instance of the linkrot class

Return type: Dictionary

Information Provided: All metadata, secret metadata associated with the PDF including Creation date, Creator, Title, etc...

get_text()

Arguments: None

Usage:

text = pdf.get_text() #pdf is the instance of the linkrot class

Return type: String

Information Provided: The entire content of the PDF in string form.

get_references(reftype=None, sort=False)

Arguments:

reftype: The type of reference that is needed 
	 values: 'pdf', 'url', 'doi', 'arxiv'. 
	 default: Provides all reference types.

sort: Whether reference should be sorted or not
      values: True or False. 
      default: Is not sorted.

Usage:

references_list = pdf.get_references() #pdf is the instance of the linkrot class

Return type: Set of

linkrot.backends.Reference object has 3 member variables:
- ref: actual URL/PDF/DOI/ARXIV
- reftype: type of reference
- page: page on which it was referenced

Information Provided: All references with their corresponding type and page number.

get_references_as_dict(reftype=None, sort=False)

Arguments:

reftype: The type of reference that is needed 
	 values: 'pdf', 'url', 'doi', 'arxiv'. 
	 default: Provides all reference types.

sort: Whether reference should be sorted or not
      values: True or False. 
      default: Is not sorted.

Usage:

references_dict = pdf.get_references_as_dict() #pdf is the instance of the linkrot class

Return type: Dictionary with keys 'pdf', 'url', 'doi', 'arxiv' that each have a list of refs of that type.

Information Provided: All references in their corresponding type list.

download_pdfs(target_dir)

Arguments:

target_dir: The path of the directory to which the reference pdfs should be downloaded

Usage:

pdf.download_pdfs("target-directory") #pdf is the instance of the linkrot class

Return type: None

Information Provided: Downloads all the reference pdfs to specified directory.

3. Linkrot downloader functions

Import:

from linkrot.downloader import sanitize_url, get_status_code, check_refs

sanitize_url(url)

Arguments:

url: The url to be sanitized.

Usage:

new_url = sanitize_url(old_url)

Return type: String

Information Provided: URL is prefixed with 'http://' if it was not before and makes sure it is in utf-8 format.

get_status_code(url)

Arguments:

url: The url to be checked for its status.

Usage:

status_code = get_status_code(url)

Return type: String

Information Provided: Checks if the url is active or broken.

check_refs(refs, verbose=True, max_threads=MAX_THREADS_DEFAULT)

Arguments:

refs: set of linkrot.backends.Reference objects
verbose: whether it should print every reference with its code or just the summary of the link checker
max_threads: number of threads for multithreading

Usage:

check_refs(pdf.get_references()) #pdf is the instance of the linkrot class

Return type: None

Information Provided: Prints references with their status code and a summary of all the broken/active links on terminal.

4. Linkrot extractor functions

Import:

from linkrot.extractor import extract_urls, extract_doi, extract_arxiv

Get pdf text:

text = pdf.get_text() #pdf is the instance of the linkrot class

extract_urls(text)

Arguments:

text: String of text to extract urls from

Usage:

urls = extract_urls(text)

Return type: Set of URLs

Information Provided: All URLs in the text

extract_arxiv(text)

Arguments:

text: String of text to extract arxivs from

Usage:

arxiv = extract_arxiv(text)

Return type: Set of arxivs

Information Provided: All arxivs in the text

extract_doi(text)

Arguments:

text: String of text to extract dois from

Usage:

doi = extract_doi(text)

Return type: Set of dois

Information Provided: All dois in the text

Code of Conduct

To view our code of conduct please visit our Code of Conduct page.

License

This program is licensed with an MIT License.

Comments

xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 55, column 10

Receive this error when I run the file. Traceback below. File Attached.

Traceback (most recent call last): File "c:\python38\lib\runpy.py", line 193, in _run_module_as_main return run_code(code, main_globals, None, File "c:\python38\lib\runpy.py", line 86, in run_code exec(code, run_globals) File "C:\Python38\Scripts\linkrot.exe_main.py", line 7, in File "c:\python38\lib\site-packages\linkrot\cli.py", line 182, in main pdf = linkrot.linkrot(args.pdf) File "c:\python38\lib\site-packages\linkrot_init.py", line 131, in init self.reader = PDFMinerBackend(self.stream) File "c:\python38\lib\site-packages\linkrot\backends.py", line 213, in init self.metadata.update(xmp_to_dict(metadata)) File "c:\python38\lib\site-packages\linkrot\libs\xmp.py", line 92, in xmp_to_dict return XmpParser(xmp).meta File "c:\python38\lib\site-packages\linkrot\libs\xmp.py", line 41, in init self.tree = ET.XML(xmp) File "c:\python38\lib\xml\etree\ElementTree.py", line 1320, in XML parser.feed(text) xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 55, column 10

ah-5.pdf
bug help wanted good first issue hacktoberfest python

opened by marshalmiller 11
Remove Python 2 checks and functionality.

Keeping support for Python 2 might be slowing down some of the process. Of more concern is that in order to patch vulnerabilities that exist in some libraries Python 2 depends on, we have had to cut support for some versions of Python 3. Specifically 3.6,3.7. 3.7 is still fairly widely used and I think I'd prefer to remove Python 2 support and bring back 3.7. Even though it's clearly a bigger task.
enhancement help wanted good first issue dependencies python

opened by marshalmiller 10
Move from `requirements.txt`, `requirements_dev.txt`, `setup.cfg`, and `setup.py` to `pyproject.toml`.

Is your feature request related to a problem? Please describe. Hey @marshalmiller. As you may already know, the use of setup.cfg, setup.py, and requirements.txt files is quite outdated. Because of PEP 517, PEP 660, and PEP 631, the packaging is now being standardized on the usage of the pyproject.toml file.

Describe the solution you'd like Given the above info, the project packaging should add support for pyproject.toml.

Describe alternatives you've considered Not available.

Additional context That's pretty much it. What do you think? Also, I would like to work on this issue.
enhancement hacktoberfest python

opened by wiseaidev 7
(Bug) AttributeError: 'NoneType' object has no attribute 'findall'
Describe the bug Certain PDFs give Attribute Error

To Reproduce Steps to reproduce the behavior:

Download Research_Ethics.pdf

Open terminal and run:

linkrot <path_to_above_file>

Expected behavior It should generate the expected linkrot report.

Screenshots
bug help wanted hacktoberfest
opened by aditirao7 7
Add Link Archiving

I'd like to add a feature that takes all links that are verified to be active and add them to the Internet Archive Wayback Machine to preserve them in time. There is a draft python script in lib called archive.py. The idea is that you navigate to https://web.archive.org/save/{url} the service automatically archives that page. So after verifying that it returns a valid code, we would just connect to all of those sites and it would create a snapshot. I'd love for this to be an optional argument like -a or something. This way it is optional and we don't take more resources than we need. Anyone able to complete this task, please take a stab at it.
enhancement help wanted good first issue hacktoberfest python

opened by marshalmiller 6

UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 0: character maps to .

Receiving this error when running the file. Traceback Below. File Attached.

> Traceback (most recent call last):
>   File "c:\python38\lib\runpy.py", line 193, in _run_module_as_main
>     return _run_code(code, main_globals, None,
>   File "c:\python38\lib\runpy.py", line 86, in _run_code
>     exec(code, run_globals)
>   File "C:\Python38\Scripts\linkrot.exe\__main__.py", line 7, in <module>
>   File "c:\python38\lib\site-packages\linkrot\cli.py", line 182, in main
>     pdf = linkrot.linkrot(args.pdf)
>   File "c:\python38\lib\site-packages\linkrot\__init__.py", line 131, in __init__
>     self.reader = PDFMinerBackend(self.stream)
>   File "c:\python38\lib\site-packages\linkrot\backends.py", line 204, in __init__
>     self.metadata[k] = make_compat_str(v)
>   File "c:\python38\lib\site-packages\linkrot\backends.py", line 67, in make_compat_str
>     out_str = in_str.decode(enc["encoding"])
>   File "c:\python38\lib\encodings\cp1254.py", line 15, in decode
>     return codecs.charmap_decode(input,errors,decoding_table)
> UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 0: character maps to <undefined>

ah-1.pdf

bug help wanted hacktoberfest python

opened by marshalmiller 5

(Update) documentation for python library usage

The main documentation needs to be updated to include the usage of linkrot as a python library as well. Some of it can be found in the docstrings of this file.
enhancement

opened by aditirao7 5
Separate code from data
Is your feature request related to a problem? Please describe.

The current size of the repo is too big because of pdf data samples:

➜ du -sh * | sort -h 4.0K CONTRIBUTING.md 4.0K LICENSE 4.0K Makefile 4.0K pyproject.toml 4.0K SECURITY.md 8.0K code_of_conduct.md 8.0K README.md 44K branding 68K linkrot 1.7M tests 919M Random PDF Samples

Describe the solution you'd like I suggest either storing the pdf files in a separate repo or on a cloud provider's bucket.

Describe alternatives you've considered Not available.

Additional context That's pretty much. I am currently working on this issue.
documentation enhancement hacktoberfest
opened by wiseaidev 4
Add Link Check Results to CLI Output

Right now, if you use the -o argument to export the results to a text file, the document metadata and the list of links are the only components listed. I would like to add the results of the link check to this output as well.
enhancement help wanted good first issue hacktoberfest python hacktoberfest-accepted

opened by marshalmiller 4
Displays Page Number Wrong in Results

When it returns the results of links that it tests, it gives a list of the links, along with a page number. The page number would appear to be the page the link was found on but it is actually just the total number of pages in the PDF. It would be extremely helpful if we could get it to display the correct page number.
bug enhancement help wanted hacktoberfest python hacktoberfest-accepted

opened by marshalmiller 4
Update Tests

The tests written for this repo were developed during the very early stages of this project. I don't think they are a great representation of where the project is now. I'd love to have them updated to be more rigorous and keep the quality of the project high.
enhancement help wanted good first issue hacktoberfest python

opened by marshalmiller 2
Update ReadMe to Include Changes from Hacktoberfest.

We have had a lot of great improvements already during Hacktoberfest. I will update the ReadMe with all the changes once the event is over, if not before.
documentation enhancement hacktoberfest

opened by marshalmiller 3
Consider Replacing Threadpool with Redis

Given the performance and timeout issues with the flask app, I am wondering if I should be replacing the current thread pool with a Redis model, as suggested by other forums and Heroku.

https://python-rq.org/
enhancement help wanted dependencies hacktoberfest python

opened by marshalmiller 2

Releases(3.9.5)

3.9.5(Oct 3, 2022)
What's Changed

Add test cases for detecting embedded URLs by @marwansalem in https://github.com/marshalmiller/linkrot/pull/161

rm Random PDF Samples by @wiseaidev in https://github.com/marshalmiller/linkrot/pull/163

updated .gitignore, added mega.py, rm pdfs, cleanups by @wiseaidev in https://github.com/marshalmiller/linkrot/pull/164

cleanup python 2 syntax by @wiseaidev in https://github.com/marshalmiller/linkrot/pull/165

Full Changelog: https://github.com/marshalmiller/linkrot/compare/3.9.4...3.9.5
Source code(tar.gz)
Source code(zip)
3.9.4(Oct 2, 2022)
What's Changed

Migrating from setup.py to pyproject.toml by @wiseaidev in https://github.com/marshalmiller/linkrot/pull/149

Upgrade to PyProject by @marshalmiller in https://github.com/marshalmiller/linkrot/pull/156

add missing dependencies by @wiseaidev in https://github.com/marshalmiller/linkrot/pull/158

add missing cli entry point by @wiseaidev in https://github.com/marshalmiller/linkrot/pull/157

handle UnicodeDecode exception by @wiseaidev in https://github.com/marshalmiller/linkrot/pull/159

Full Changelog: https://github.com/marshalmiller/linkrot/compare/3.9.3...3.9.4
Source code(tar.gz)
Source code(zip)
3.9.3(Oct 2, 2022)
What's Changed

Resolved Add Link Archiving #102 by @mailtodanish in https://github.com/marshalmiller/linkrot/pull/150

add etree xml_parser to ignore invalid tags by @wiseaidev in https://github.com/marshalmiller/linkrot/pull/155

Full Changelog: https://github.com/marshalmiller/linkrot/compare/3.9.2...3.9.3
Source code(tar.gz)
Source code(zip)
3.9.2(Oct 1, 2022)
What's Changed

Fix the page number error, in the link checker by @ajratnam in https://github.com/marshalmiller/linkrot/pull/147

Add Link Check Results to CLI Output #120 by @mailtodanish in https://github.com/marshalmiller/linkrot/pull/145

Full Changelog: https://github.com/marshalmiller/linkrot/compare/3.9.1...3.9.2
Source code(tar.gz)
Source code(zip)
3.9.1(Oct 1, 2022)
What's Changed

Bump mypy from 0.971 to 0.981 by @dependabot in https://github.com/marshalmiller/linkrot/pull/142

Bump coverage from 6.4.4 to 6.5.0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/143

Resolved Add DOIs to References Summary #128 by @mailtodanish in https://github.com/marshalmiller/linkrot/pull/144

Remove numpy import by @ajratnam in https://github.com/marshalmiller/linkrot/pull/146

New Contributors

@mailtodanish made their first contribution in https://github.com/marshalmiller/linkrot/pull/144

@ajratnam made their first contribution in https://github.com/marshalmiller/linkrot/pull/146

Full Changelog: https://github.com/marshalmiller/linkrot/compare/3.9...3.9.1
Source code(tar.gz)
Source code(zip)
3.9(Sep 25, 2022)
What's Changed

Bump flake8 from 5.0.3 to 5.0.4 by @dependabot in https://github.com/marshalmiller/linkrot/pull/131

Bump coverage from 6.4.2 to 6.4.3 by @dependabot in https://github.com/marshalmiller/linkrot/pull/132

Bump numpy from 1.23.1 to 1.23.2 by @dependabot in https://github.com/marshalmiller/linkrot/pull/133

Bump coverage from 6.4.3 to 6.4.4 by @dependabot in https://github.com/marshalmiller/linkrot/pull/134

Bump pylint from 2.14.5 to 2.15.0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/135

Bump black from 22.6.0 to 22.8.0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/136

Bump pytest from 7.1.2 to 7.1.3 by @dependabot in https://github.com/marshalmiller/linkrot/pull/137

Bump pylint from 2.15.0 to 2.15.2 by @dependabot in https://github.com/marshalmiller/linkrot/pull/138

Bump numpy from 1.23.2 to 1.23.3 by @dependabot in https://github.com/marshalmiller/linkrot/pull/139

Bump pylint from 2.15.2 to 2.15.3 by @dependabot in https://github.com/marshalmiller/linkrot/pull/141

Resolve issue130 by @westofwest in https://github.com/marshalmiller/linkrot/pull/140

New Contributors

@westofwest made their first contribution in https://github.com/marshalmiller/linkrot/pull/140

Full Changelog: https://github.com/marshalmiller/linkrot/compare/3.8.8...3.9
Source code(tar.gz)
Source code(zip)
3.8.8(Aug 2, 2022)

Full Changelog: https://github.com/marshalmiller/linkrot/compare/3.8.7...3.8.8
Source code(tar.gz)
Source code(zip)
3.8.5(Aug 2, 2022)
What's Changed

Bump flake8 from 5.0.1 to 5.0.3 by @dependabot in https://github.com/marshalmiller/linkrot/pull/129

Full Changelog: https://github.com/marshalmiller/linkrot/compare/3.8.4...3.8.5
Source code(tar.gz)
Source code(zip)
3.5(Jun 1, 2022)
What's Changed

Bump mypy from 0.910 to 0.920 by @dependabot in https://github.com/marshalmiller/linkrot/pull/71

Bump mypy from 0.920 to 0.930 by @dependabot in https://github.com/marshalmiller/linkrot/pull/73

Bump mypy from 0.930 to 0.931 by @dependabot in https://github.com/marshalmiller/linkrot/pull/75

Bump mccabe from 0.6.1 to 0.7.0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/76

Bump coverage from 6.2 to 6.3 by @dependabot in https://github.com/marshalmiller/linkrot/pull/77

Bump black from 21.12b0 to 22.1.0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/78

Bump coverage from 6.3 to 6.3.1 by @dependabot in https://github.com/marshalmiller/linkrot/pull/79

Bump pytest from 6.2.5 to 7.0.0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/80

Bump pytest from 7.0.0 to 7.0.1 by @dependabot in https://github.com/marshalmiller/linkrot/pull/81

Bump coverage from 6.3.1 to 6.3.2 by @dependabot in https://github.com/marshalmiller/linkrot/pull/82

Bump pytest from 7.0.1 to 7.1.0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/84

Bump mypy from 0.931 to 0.940 by @dependabot in https://github.com/marshalmiller/linkrot/pull/83

Bump mypy from 0.940 to 0.941 by @dependabot in https://github.com/marshalmiller/linkrot/pull/85

Bump pytest from 7.1.0 to 7.1.1 by @dependabot in https://github.com/marshalmiller/linkrot/pull/86

Bump pdfminer-six from 20211012 to 20220319 by @dependabot in https://github.com/marshalmiller/linkrot/pull/87

Bump mypy from 0.941 to 0.942 by @dependabot in https://github.com/marshalmiller/linkrot/pull/88

Bump pylint from 2.12.2 to 2.13.0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/89

Bump pylint from 2.13.0 to 2.13.2 by @dependabot in https://github.com/marshalmiller/linkrot/pull/90

Bump black from 22.1.0 to 22.3.0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/91

Bump pylint from 2.13.2 to 2.13.3 by @dependabot in https://github.com/marshalmiller/linkrot/pull/92

Bump pylint from 2.13.3 to 2.13.4 by @dependabot in https://github.com/marshalmiller/linkrot/pull/93

Bump pylint from 2.13.4 to 2.13.5 by @dependabot in https://github.com/marshalmiller/linkrot/pull/94

Bump pylint from 2.13.5 to 2.13.7 by @dependabot in https://github.com/marshalmiller/linkrot/pull/95

Bump pytest from 7.1.1 to 7.1.2 by @dependabot in https://github.com/marshalmiller/linkrot/pull/96

Bump mypy from 0.942 to 0.950 by @dependabot in https://github.com/marshalmiller/linkrot/pull/97

Bump pylint from 2.13.7 to 2.13.8 by @dependabot in https://github.com/marshalmiller/linkrot/pull/98

Bump pdfminer-six from 20220319 to 20220506 by @dependabot in https://github.com/marshalmiller/linkrot/pull/99

Bump coverage from 6.3.2 to 6.3.3 by @dependabot in https://github.com/marshalmiller/linkrot/pull/100

Bump pylint from 2.13.8 to 2.13.9 by @dependabot in https://github.com/marshalmiller/linkrot/pull/101

Bump coverage from 6.3.3 to 6.4 by @dependabot in https://github.com/marshalmiller/linkrot/pull/103

Bump pdfminer-six from 20220506 to 20220524 by @dependabot in https://github.com/marshalmiller/linkrot/pull/104

Bump mypy from 0.950 to 0.960 by @dependabot in https://github.com/marshalmiller/linkrot/pull/105

A fix for: Exclude Email Addresses #106 by @marwansalem in https://github.com/marshalmiller/linkrot/pull/107

New Contributors

@marwansalem made their first contribution in https://github.com/marshalmiller/linkrot/pull/107

Full Changelog: https://github.com/marshalmiller/linkrot/compare/3.4...3.5
Source code(tar.gz)
Source code(zip)
3.4(Dec 11, 2021)
What's Changed

Added documentation for library by @aditirao7 in https://github.com/marshalmiller/linkrot/pull/41

fix(downloader.py): change string comparison to use regex by @sousatg in https://github.com/marshalmiller/linkrot/pull/42

Bump flake8 from 4.0.0 to 4.0.1 by @dependabot in https://github.com/marshalmiller/linkrot/pull/43

Bump coverage from 6.0.1 to 6.0.2 by @dependabot in https://github.com/marshalmiller/linkrot/pull/44

Bump pdfminer-six from 20201018 to 20211012 by @dependabot in https://github.com/marshalmiller/linkrot/pull/46

Bring up to date by @marshalmiller in https://github.com/marshalmiller/linkrot/pull/47

Replace pagenos with a safe default value by @alanyee in https://github.com/marshalmiller/linkrot/pull/48

Staging to Main 10-17-2021 by @marshalmiller in https://github.com/marshalmiller/linkrot/pull/49

Start testing for Python 3.10 by @alanyee in https://github.com/marshalmiller/linkrot/pull/50

Checking the rdftree before parsing the metadata #45 by @rosdyana in https://github.com/marshalmiller/linkrot/pull/51

Staging by @marshalmiller in https://github.com/marshalmiller/linkrot/pull/52

Bump black from 21.9b0 to 21.10b0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/55

Bump coverage from 6.0.2 to 6.1.1 by @dependabot in https://github.com/marshalmiller/linkrot/pull/54

Add comments to colorprint.py by @vacom13 in https://github.com/marshalmiller/linkrot/pull/56

Bump coverage from 6.1.1 to 6.1.2 by @dependabot in https://github.com/marshalmiller/linkrot/pull/57

Bump black from 21.10b0 to 21.11b0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/58

Add Comments to cli.py by @vacom13 in https://github.com/marshalmiller/linkrot/pull/60

Bump black from 21.11b0 to 21.11b1 by @dependabot in https://github.com/marshalmiller/linkrot/pull/59

Bump pylint from 2.11.1 to 2.12.1 by @dependabot in https://github.com/marshalmiller/linkrot/pull/61

Bump coverage from 6.1.2 to 6.2 by @dependabot in https://github.com/marshalmiller/linkrot/pull/63

Bump black from 21.11b1 to 21.12b0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/67

Bump pylint from 2.12.1 to 2.12.2 by @dependabot in https://github.com/marshalmiller/linkrot/pull/66

New Contributors

@sousatg made their first contribution in https://github.com/marshalmiller/linkrot/pull/42

@alanyee made their first contribution in https://github.com/marshalmiller/linkrot/pull/48

@rosdyana made their first contribution in https://github.com/marshalmiller/linkrot/pull/51

@vacom13 made their first contribution in https://github.com/marshalmiller/linkrot/pull/56

Full Changelog: https://github.com/marshalmiller/linkrot/compare/2.1.1...3.4
Source code(tar.gz)
Source code(zip)
2.3(Oct 24, 2021)
What's Changed

Added documentation for library by @aditirao7 in https://github.com/marshalmiller/linkrot/pull/41

fix(downloader.py): change string comparison to use regex by @sousatg in https://github.com/marshalmiller/linkrot/pull/42

Bump flake8 from 4.0.0 to 4.0.1 by @dependabot in https://github.com/marshalmiller/linkrot/pull/43

Bump coverage from 6.0.1 to 6.0.2 by @dependabot in https://github.com/marshalmiller/linkrot/pull/44

Bump pdfminer-six from 20201018 to 20211012 by @dependabot in https://github.com/marshalmiller/linkrot/pull/46

Bring up to date by @marshalmiller in https://github.com/marshalmiller/linkrot/pull/47

Replace pagenos with a safe default value by @alanyee in https://github.com/marshalmiller/linkrot/pull/48

Staging to Main 10-17-2021 by @marshalmiller in https://github.com/marshalmiller/linkrot/pull/49

Start testing for Python 3.10 by @alanyee in https://github.com/marshalmiller/linkrot/pull/50

Checking the rdftree before parsing the metadata #45 by @rosdyana in https://github.com/marshalmiller/linkrot/pull/51

Staging by @marshalmiller in https://github.com/marshalmiller/linkrot/pull/52

New Contributors

@sousatg made their first contribution in https://github.com/marshalmiller/linkrot/pull/42

@alanyee made their first contribution in https://github.com/marshalmiller/linkrot/pull/48

@rosdyana made their first contribution in https://github.com/marshalmiller/linkrot/pull/51

Full Changelog: https://github.com/marshalmiller/linkrot/compare/2.1.1...2.3
Source code(tar.gz)
Source code(zip)

Owner

Marshal Miller

GitHub Repository

rst2pdf: Use a text editor. Make a PDF.

487 Jan 06, 2023

Convert MD files to PDF automatically (with CSS) 📄🚀

MD2PDF Action Convert MD files to PDF automatically (with CSS)! Converts a pattern described set of markdown files and converts them to pdf whilst app

1 Feb 09, 2022

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted. ocrmypdf

8k Jan 08, 2023

WeasyPrint is a smart solution helping web developers to create PDF documents.

WeasyPrint is a smart solution helping web developers to create PDF documents. It turns simple HTML pages into gorgeous statistical reports, invoices, tickets…

5.4k Jan 08, 2023

Excalibur: A web interface to extract tabular data from PDFs

Excalibur: A web interface to extract tabular data from PDFs Excalibur is a web interface to extract tabular data from PDFs, written in Python 3! It i

1.2k Jan 04, 2023

An application which enables the users to perform simple yet intriguing PDF operations

AstutePDF A repository containing the GUI for an application which enables the users to perform simple yet intriguing PDF operations. These include, M

5 Jan 22, 2022

Simple HTML and PDF document generator for Python - with built-in support for popular data analysis and plotting libraries.

Esparto is a simple HTML and PDF document generator for Python. Its primary use is for generating shareable single page reports with content from popular analytics and data science libraries.

76 Dec 12, 2022

Table automatically extraction from PDF Document

PDF Table Extractor Table automatically extraction from PDF Document Our Icon 📌 Name : PDF Table Extractor 📌 Authors : Minku Koo Jiyong Park 📌 Deve

1 Jan 10, 2022

Produce pdf in python backend from simple bootstrap vue frontend and download to browser

vollmacht produce pdf in python backend from simple bootstrap vue frontend and download to browser Frontend in one file with bootstrap-vue (allthough

1 Nov 08, 2020

Convert Lecture Videos to PDF

Convert Lecture Videos to PDF Description Want to go through lecture videos faster without missing any information? Wish you can read the lecture vide

20 Nov 25, 2022

pdf_sprinkles: sprinkles text in your PDFs

pdf_sprinkles: sprinkles text in your PDFs pdf_sprinkles remotely OCRs a PDF with Google Cloud Document AI, and returns the result as a PDF with searc

2 Dec 17, 2021

Convert PDF to AudioBook and Audio Speech to PDF

In this Python project, we will build a GUI-based PDF to Audio and Audio to PDF converter using the Tkinter, OS, path, pyttsx3, SpeechRecognition, PyPDF4, and Pydub libraries and the messagebox modul

1 Feb 13, 2022

Merge multiple PDF files into one.

PDF Merger Merge multiple PDF files into one. Usage % python pdf_merger.py -h usage: pdf_merger.py [-h] [-o OUTPUT] [-f [FILES ...]] optional argumen

6 Oct 03, 2022

Python lib for Simple PDF text extraction

651 Jan 01, 2023

Camelot is a Python library that makes it easy for anyone to extract tables from PDF files

Camelot: PDF Table Extraction for Humans Camelot is a Python library that makes it easy for anyone to extract tables from PDF files! Note: You can als

3.3k Jan 06, 2023

Pdfencrypt is a tool to encrypt/lock PDFs

Pdfencrypt Pdfencrypt is a tool to encrypt/lock PDFs Installation $ apt update $ apt upgrade $ apt install git $ apt install python $ git clone https:

5 Nov 28, 2021

Scans pdfs for links written in plaintext and checks if they are active or returns an error code.

Scans pdfs for links written in plaintext and checks if they are active or returns an error code. It then generates a report of its findings. Extract references (pdf, url, doi, arxiv) and metadata fr

22 Nov 21, 2022

PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files.

PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files

5k Jan 04, 2023

A bot for PDF for doing Many Things....

Telegram PDF Bot A Telegram bot that can: Compress, crop, decrypt, encrypt, merge, preview, rename, rotate, scale and split PDF files Compare text dif

60 Dec 27, 2022

Performing the following operations using python on PDF.

Python PDF Handling Tutorial Python is a highly versatile language with a huge set of libraries. It is a high level language with simple syntax. Pytho

131 Dec 16, 2022

Scans pdfs for links written in plaintext and checks if they are active or returns an error code.

Related tags

Overview

Introduction

Features

Installation

Usage

1. Command Line/Terminal tool

Arguments

positional arguments:

optional arguments:

Examples

Extract text to console

Extract text to file

Check Links

2. Main Python Library

get_metadata()

get_text()

get_references(reftype=None, sort=False)

get_references_as_dict(reftype=None, sort=False)

download_pdfs(target_dir)

3. Linkrot downloader functions

sanitize_url(url)

get_status_code(url)

check_refs(refs, verbose=True, max_threads=MAX_THREADS_DEFAULT)

4. Linkrot extractor functions

extract_urls(text)

extract_arxiv(text)

extract_doi(text)

Code of Conduct

License

Comments

Releases(3.9.5)

3.9.5(Oct 3, 2022)

What's Changed

3.9.4(Oct 2, 2022)

What's Changed

3.9.3(Oct 2, 2022)

What's Changed

3.9.2(Oct 1, 2022)

What's Changed

3.9.1(Oct 1, 2022)

What's Changed

New Contributors

3.9(Sep 25, 2022)

What's Changed

New Contributors

3.8.8(Aug 2, 2022)

3.8.5(Aug 2, 2022)

What's Changed

3.5(Jun 1, 2022)

What's Changed

New Contributors

3.4(Dec 11, 2021)

What's Changed

New Contributors

2.3(Oct 24, 2021)

What's Changed

New Contributors

Owner

Marshal Miller

rst2pdf: Use a text editor. Make a PDF.

Convert MD files to PDF automatically (with CSS) 📄🚀

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

WeasyPrint is a smart solution helping web developers to create PDF documents.

Excalibur: A web interface to extract tabular data from PDFs

An application which enables the users to perform simple yet intriguing PDF operations

Simple HTML and PDF document generator for Python - with built-in support for popular data analysis and plotting libraries.

Table automatically extraction from PDF Document

Produce pdf in python backend from simple bootstrap vue frontend and download to browser

Convert Lecture Videos to PDF

pdf_sprinkles: sprinkles text in your PDFs

Convert PDF to AudioBook and Audio Speech to PDF

Merge multiple PDF files into one.

Python lib for Simple PDF text extraction

Camelot is a Python library that makes it easy for anyone to extract tables from PDF files

Pdfencrypt is a tool to encrypt/lock PDFs

Scans pdfs for links written in plaintext and checks if they are active or returns an error code.

PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files.

A bot for PDF for doing Many Things....