A modern pure-Python library for reading PDF files

Last update: Apr 06, 2022

Related tags

Overview

pdf

A modern pure-Python library for reading PDF files.

The goal is to have a modern interface to handle PDF files which is consistent with itself and typical Python syntax.

The library should be Python-only (hence no C-extensions), but allow to change the backend. Similar in concept to matplotlib backends and Keras backends.

The default backend could be PyPDF2.

Possible other backends could be PyMuPDF (using MuPDF) and PikePDF (using QPDF).

WARNING: This library is UNSTABLE at the moment! Expect many changes!

Installation

pip install pdffile

Usage

Retrieve Metadata

>>> import pdf

>>> doc = pdf.PdfFile("001-trivial/minimal-document.pdf")
>>> len(doc)
1

>>> doc.metadata
Metadata(
    title=None,
    producer='pdfTeX-1.40.23',
    creator='TeX',
    creation_date=datetime.datetime(2022, 4, 3, 18, 5, 42),
    modification_date=datetime.datetime(2022, 4, 3, 18, 5, 42)
    other={
         '/CreationDate': "D:20220403180542+02'00'",
         '/ModDate': "D:20220403180542+02'00'",
         '/Trapped': '/False',
         '/PTEX.Fullbanner': 'This is pdfTeX, V...'})

Encrypted PDFs

If you have an encrypted PDF, just provide the key:

doc = pdf.PdfFile(pdf_path, password=password)

All following operations work just as described.

Get Outline

>>> import pdf
>>> doc = pdf.PdfFile(pdf_path, password=password)
>>> doc.outline
[
    Links(page=5, text='1 Header'),
    Links(page=5, text='1.1 A section'),
    Links(page=9, text='2 Foobar'),
    Links(page=108, text='References')
]

Extract Text

>>> import pdf
>>> doc = pdf.PdfFile("001-trivial/minimal-document.pdf")
>>> doc[0]
<pdf.PdfPage object at 0x7f72d2b04100>
>>> doc[0].text
'Loremipsumdolorsitamet,consetetursadipscingelitr,seddiamnonumyeirmod\ntemporinviduntutlaboreetdoloremagnaaliquyamerat,seddiamvoluptua.Atvero\neosetaccusametjustoduodoloresetearebum.Stetclitakasdgubergren,noseataki-\nmatasanctusestLoremipsumdolorsitamet.Loremipsumdolorsitamet,consetetur\nsadipscingelitr,seddiamnonumyeirmodtemporinviduntutlaboreetdoloremagna\naliquyamerat,seddiamvoluptua.Atveroeosetaccusametjustoduodoloresetea\nrebum.Stetclitakasdgubergren,noseatakimatasanctusestLoremipsumdolorsit\namet.\n1\n'

Alternatively, you can use doc.text to get the text of all pages.

A modern pure-Python library for reading PDF files

Related tags

Overview

pdf

Installation

Usage

Retrieve Metadata

Encrypted PDFs

Get Outline

Extract Text

Owner

This repository contains code to train and render Mixture of Volumetric Primitives (MVP) models

SmoothGrad implementation in PyTorch

Code for the paper "Generative design of breakwaters usign deep convolutional neural network as a surrogate model"

BoxInst: High-Performance Instance Segmentation with Box Annotations

Neuralnetwork - Basic Multilayer Perceptron Neural Network for deep learning

PyTorch implementation for "Mining Latent Structures with Contrastive Modality Fusion for Multimedia Recommendation"

Official PaddlePaddle implementation of Paint Transformer

1st ranked 'driver careless behavior detection' for AI Online Competition 2021, hosted by MSIT Korea.

Space Invaders For Python

UltraPose: Synthesizing Dense Pose with 1 Billion Points by Human-body Decoupling 3D Model

Implementation of DropLoss for Long-Tail Instance Segmentation in Pytorch

这是一个deeplabv3-plus-pytorch的源码，可以用于训练自己的模型。

Implementation of Barlow Twins paper

Blind Image Super-resolution with Elaborate Degradation Modeling on Noise and Kernel

The King is Naked: on the Notion of Robustness for Natural Language Processing

Consecutive-Subsequence - Simple software to calculate susequence with highest sum

Meaningful titles for tabs and PDF downloads! Also supports tab search.

Count GitHub Stars ⭐

This's an implementation of deepmind Visual Interaction Networks paper using pytorch

EDPN: Enhanced Deep Pyramid Network for Blurry Image Restoration