Camelot is a Python library that can help you extract tables from PDFs!

Last update: Jan 03, 2023

Related tags

PDF Files Processing camelot

Overview

Camelot: PDF Table Extraction for Humans

Camelot is a Python library that can help you extract tables from PDFs!

Note: You can also check out Excalibur, the web interface to Camelot!

Here's how you can extract tables from PDFs. You can check out the PDF used in this example here.

>>> import camelot
>>> tables = camelot.read_pdf('foo.pdf')
>>> tables

>>> tables.export('foo.csv', f='csv', compress=True) # json, excel, html, markdown, sqlite
>>> tables[0]

>>> tables[0].parsing_report
{
    'accuracy': 99.02,
    'whitespace': 12.24,
    'order': 1,
    'page': 1
}
>>> tables[0].to_csv('foo.csv') # to_json, to_excel, to_html, to_markdown, to_sqlite
>>> tables[0].df # get a pandas DataFrame!




Cycle Name
KI (1/km)
Distance (mi)
Percent Fuel Savings










Improved Speed
Decreased Accel
Eliminate Stops
Decreased Idle


2012_2
3.30
1.3
5.9%
9.5%
29.2%
17.4%


2145_1
0.68
11.2
2.4%
0.1%
9.5%
2.7%


4234_1
0.59
58.7
8.5%
1.3%
8.5%
3.3%


2032_2
0.17
57.8
21.7%
0.3%
2.7%
1.2%


4171_1
0.07
173.9
58.1%
1.6%
2.1%
0.5%



Camelot also comes packaged with a command-line interface!
Note: Camelot only works with text-based PDFs and not scanned documents. (As Tabula explains, "If you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based".)
You can check out some frequently asked questions here.

Why Camelot?

Configurability: Camelot gives you control over the table extraction process with tweakable settings.
Metrics: You can discard bad tables based on metrics like accuracy and whitespace, without having to manually look at each table.
Output: Each table is extracted into a pandas DataFrame, which seamlessly integrates into ETL and data analysis workflows. You can also export tables to multiple formats, which include CSV, JSON, Excel, HTML, Markdown, and Sqlite.

See comparison with similar libraries and tools.

Support the development
If Camelot has helped you, please consider supporting its development with a one-time or monthly donation on OpenCollective.

Installation

Using conda
The easiest way to install Camelot is with conda, which is a package manager and environment management system for the Anaconda distribution.
$ conda install -c conda-forge camelot-py


Using pip
After installing the dependencies (tk and ghostscript), you can also just use pip to install Camelot:
$ pip install "camelot-py[base]"


From the source code
After installing the dependencies, clone the repo using:
$ git clone https://www.github.com/camelot-dev/camelot

and install Camelot using pip:
$ cd camelot
$ pip install ".[base]"


Documentation
The documentation is available at http://camelot-py.readthedocs.io/.

Wrappers

camelot-php provides a PHP wrapper on Camelot.


Contributing
The Contributor's Guide has detailed information about contributing issues, documentation, code, and tests.

Versioning
Camelot uses Semantic Versioning. For the available versions, see the tags on this repository. For the changelog, you can check out HISTORY.md.

License
This project is licensed under the MIT License, see the LICENSE file for details.

Camelot is a Python library that can help you extract tables from PDFs!

Related tags

Overview

Camelot: PDF Table Extraction for Humans

Why Camelot?

Support the development

Installation

Using conda

Using pip

From the source code

Documentation

Wrappers

Contributing

Versioning

License

Owner

Python PDF Parser (Not actively maintained). Check out pdfminer.six.

Split given PDF document into 4 page groups and convert them to booklet format

A bot for PDF for doing Many Things....

Small python-gtk application, which helps the user to merge or split pdf documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface

An application which enables the users to perform simple yet intriguing PDF operations

Program that locks/unlocks pdf files🐍

Python lib for Simple PDF text extraction

A python library for extracting text from PDFs without losing the formatting of the PDF content.

A simple pdf size compressing telegram robot witten in python.

borb is a library for reading, creating and manipulating PDF files in python.

Camelot is a Python library that makes it easy for anyone to extract tables from PDF files

pdf_sprinkles: sprinkles text in your PDFs

Convert given source code into .pdf with syntax highlighting and more features

Performing the following operations using python on PDF.

x-ray is a Python library for finding bad redactions in PDF documents.

Produce pdf in python backend from simple bootstrap vue frontend and download to browser

PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files.

A Python tool to generate a static HTML file that represents the internal structure of a PDF file

minipdf is a package for creating simple, single-page PDF documents.

JoplinPdf2Images - Converts a PDF to images in Joplin and adds it to the specified note as a printout

Cycle Name	KI (1/km)	Distance (mi)	Percent Fuel Savings
			Improved Speed	Decreased Accel	Eliminate Stops	Decreased Idle
2012_2	3.30	1.3	5.9%	9.5%	29.2%	17.4%
2145_1	0.68	11.2	2.4%	0.1%	9.5%	2.7%
4234_1	0.59	58.7	8.5%	1.3%	8.5%	3.3%
2032_2	0.17	57.8	21.7%	0.3%	2.7%	1.2%
4171_1	0.07	173.9	58.1%	1.6%	2.1%	0.5%