Extract the table in the PDF,outputs the data similar to the json format

Overview

简介

在开发RPA项目时,需要提取pdf表格内容,并保留表格格式。在网络中苦苦寻求多日,未能找到一份完全满足项目需求的开源库。最终采用pymupdf+cv2框架实现对pdf表格的提取。由pymupdf读取pdf(pumupdf还支持xps格式文件)内容,而cv2依据提出内容中的线条绘制并计算表格轮廓,最终找找到文本内容与表格对应关系。项目比较小众,代码也很零散,但希望能够帮助到恰好有需要的人。

In the RPA project, the content in pdf format needs to be extracted and the table format is retained. I have been struggling for many days in the network to find an open source library that fully meets the needs of the project. Finally, the pymupdf + cv2 framework is used to read the content of pdf from pymupdf (pumupdf also supports xps format files), and cv2 elaborates the drawing in the proposed content and calculates the table, and finally finds the relationship between the found content text and the table. There are many projects, and the code is very fragmented, but I hope to help those in need.

已知的开源框架

在项目中

  1. tabula-py源码使用java实现,可以参考tabula-java。提取PDF表格能力强悍,但在项目运行中偶尔出现一些异常
  2. pdfplumber使用非常便捷,但部分pdf中的表格无法提取
  3. camelot因为本人水平有限,pip安装过程中遇到一些问题,导致无法安装

项目依赖

python3
PyMuPDF==1.19.1
cv2==4.5.4

具体实现

由于已有的开源项目不能满足限制的项目,于是打算使用机器视觉的方式来提取表格相关的信息。大致处理流程如下:

  1. 获取pdf的当前页文档的内容,例如文本,坐标等
  2. 当前页的长宽,创建一块相同尺寸cv2的Mat画布
  3. 获取当前页的所有线条,并在画布上绘制线条
  4. 使用轮廓包围,查找到所有各自表格格子的矩形框坐标
  5. 使用当前页的get_text_selection方法获取每个格子的文字信息
  6. 延长表格线条,使其到达表格边缘,用于检测表格格子所占行列数量
  7. 计算文字与表格对应关系,以及表格填充范围

注意部分

其中有几个需要注意的细节部分:

  1. 可以创建一个单通道的画布,这样可以避免灰度化和二值化操作
  2. 使用白底黑线,并使用漫水填充边缘,这样可以避免轮廓分析
  3. 如果表格线条是双实线,可以用开闭运算去掉双实线
  4. 使用get_text_selection方法时需要注意文本是否已经超出cell框的边界,如果超过边界,则只能获取到边框内的文字。如果存在这种表格,可以根据判断文字区域的中心坐标是否在cell中来提取文字。

实现步骤图

加载PDF

原始pdf截

绘制表格线条

表格轮廓

裁剪表格

表格裁剪

线条延长

延长线条

解析异常的表格

不规范格式的表格

不规范格式的表格

不规范格式的表格

A simple Python script to convert multiple images (well technically also a single image) into a pdf.

PythonImage2PDF A simple Python script to convert multiple images into a single PDF-document. Created basically for only my own needs for converting m

Joona Gynther 1 Jun 28, 2022
This is PDF Merger Application Developed using Just Python

This is PDF Merger Application Developed using Just Python

Sandeep Kumar Reddy 2 Nov 18, 2021
Convert PDF to AudioBook and Audio Speech to PDF

In this Python project, we will build a GUI-based PDF to Audio and Audio to PDF converter using the Tkinter, OS, path, pyttsx3, SpeechRecognition, PyPDF4, and Pydub libraries and the messagebox modul

RISHABH MISHRA 1 Feb 13, 2022
Small python-gtk application, which helps the user to merge or split pdf documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface

Small python-gtk application, which helps the user to merge or split pdf documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface

1.8k Dec 29, 2022
Scans pdfs for links written in plaintext and checks if they are active or returns an error code.

Scans pdfs for links written in plaintext and checks if they are active or returns an error code. It then generates a report of its findings. Extract references (pdf, url, doi, arxiv) and metadata fr

Marshal Miller 22 Nov 21, 2022
borb is a library for reading, creating and manipulating PDF files in python.

borb is a library for reading, creating and manipulating PDF files in python.

Joris Schellekens 2.9k Jan 01, 2023
Busca no nome e conteúdo de arquivos PDF no diretório e subdiretórios.

PDF Finder Este script auxilia na pesquisa em pastas com inúmeros arquivos PDF. A pesquisa é feita em todos os arquivos do doretório e subdiretórios.

William Pilger 1 Nov 27, 2021
Generate a bunch of malicious pdf files with phone-home functionality. Can be used with Burp Collaborator

Malicious PDF Generator ☠️ Generate ten different malicious pdf files with phone-home functionality. Can be used with Burp Collaborator. Used for pene

Jonas Lejon 1.9k Jan 01, 2023
A Python tool to generate a static HTML file that represents the internal structure of a PDF file

PDFSyntax A Python tool to generate a static HTML file that represents the internal structure of a PDF file At some point the low-level functions deve

Martin D. 394 Dec 30, 2022
Python bindings for MuPDF's rendering library.

PyMuPDF 1.19.3 Release date: December 15, 2021 On PyPI since August 2016: Author Jorj X. McKie, based on original code by Ruikai Liu. Introduction PyM

Jorj X. McKie 0 Nov 03, 2022
A tool for certificate PDF generation.

certificate-pdf-generator 获奖证书PDF批量生成工具 | a Tool for certificate PDF generation. ⚠️ 下载前请注意 本项目使用了LFS来存储PDF等大文件。在克隆或下载本仓库前,请先使用apt等包管理器安装git-lfs包。如果已经克

Wanghao Xu 4 Nov 28, 2022
WeasyPrint is a smart solution helping web developers to create PDF documents.

WeasyPrint is a smart solution helping web developers to create PDF documents. It turns simple HTML pages into gorgeous statistical reports, invoices, tickets…

Kozea 5.4k Jan 08, 2023
Telegram bot that can do a lot of things related to PDF files.

Telegram PDF Bot A Telegram bot that can: Compress, crop, decrypt, encrypt, merge, preview, rename, rotate, scale and split PDF files Compare text dif

130 Dec 26, 2022
A bulk pdf generator. This application can generate PDFs in bulk by using just one click.

A bulk html pdf generator. This application can generate PDFs in bulk by using just one click. Screenshots Requirements 🧱 Your system must have the f

Aman Nirala 3 Apr 23, 2022
JoplinPdf2Images - Converts a PDF to images in Joplin and adds it to the specified note as a printout

joplinPdf2Images Converts a PDF to images in Joplin and adds it to the specified

Morten Haahr Kristensen 2 Apr 20, 2022
Trata PDF para torná-lo compatível com PDF/X e com impressoras em escala de cinza.

tratapdf Trata PDF para torná-lo compatível com PDF/X e com impressoras em escala de cinza. dependências icc-profiles ghostscript visualizador de PDF

1 Nov 30, 2021
An application which enables the users to perform simple yet intriguing PDF operations

AstutePDF A repository containing the GUI for an application which enables the users to perform simple yet intriguing PDF operations. These include, M

Raghav S 5 Jan 22, 2022
Pdfencrypt is a tool to encrypt/lock PDFs

Pdfencrypt Pdfencrypt is a tool to encrypt/lock PDFs Installation $ apt update $ apt upgrade $ apt install git $ apt install python $ git clone https:

Anontemitayo 5 Nov 28, 2021
Python script that split PDF files.

Automatic PDF Splitter This script can create new single-page PDFs files from multipaged PDFs. Requirements Python 3.0+ # Debian distros sudo apt-get

Leandro Padula 5 Apr 02, 2022
A python library for extracting text from PDFs without losing the formatting of the PDF content.

Multilingual PDF to Text Install Package from Pypi Install it using pip. pip install multilingual-pdf2text The library uses Tesseract which can be ins

Shahrukh Khan 49 Nov 07, 2022