Data, model training, and evaluation code for "PubTables-1M: Towards a universal dataset and metrics for training and evaluating table extraction models".

Last update: Jan 04, 2023

Related tags

Deep Learning table-transformer

Overview

PubTables-1M

This repository contains training and evaluation code for the paper "PubTables-1M: Towards a universal dataset and metrics for training and evaluating table extraction models".

The goal of PubTables-1M is to create a large, detailed, high-quality dataset for training and evaluating a wide variety of models for the tasks of table detection, table structure recognition, and functional analysis. It contains:

460,589 annotated document pages containing tables for table detection.
947,642 fully annotated tables including text content and complete location (bounding box) information for table structure recognition and functional analysis.
Full bounding boxes in both image and PDF coordinates for all table rows, columns, and cells (including blank cells), as well as other annotated structures such as column headers and projected row headers.
Rendered images of all tables and pages.
Bounding boxes and text for all words appearing in each table and page image.
Additional cell properties not used in the current model training.

Additionally, cells in the headers are canonicalized and we implement multiple quality control steps to ensure the annotations are as free of noise as possible. For more details, please see our paper.

News

10/21/2021: The full PubTables-1M dataset has been officially released on Microsoft Research Open Data.

Getting the Data

PubTables-1M is available for download from Microsoft Research Open Data.

It comes in 5 tar.gz files:

PubTables-1M-Image_Page_Detection_PASCAL_VOC.tar.gz
PubTables-1M-Image_Page_Words_JSON.tar.gz
PubTables-1M-Image_Table_Structure_PASCAL_VOC.tar.gz
PubTables-1M-Image_Table_Words_JSON.tar.gz
PubTables-1M-PDF_Annotations_JSON.tar.gz

To download from the command line:

Visit the dataset home page with a web browser and click Download in the top left corner. This will create a link to download the dataset from Azure with a unique access token for you that looks like https://msropendataset01.blob.core.windows.net/pubtables1m?[SAS_TOKEN_HERE].
You can then use the command line tool azcopy to download all of the files with the following command:

azcopy copy "https://msropendataset01.blob.core.windows.net/pubtables1m?[SAS_TOKEN_HERE]" "/path/to/your/download/folder/" --recursive

Then unzip each of the archives from the command line using:

tar -xzvf yourfile.tar.gz

Code Installation

Create a conda environment from the yml file and activate it as follows

conda env create -f environment.yml
conda activate tables-detr

Model Training

The code trains models for 2 different sets of table extraction tasks:

Table Detection
Table Structure Recognition + Functional Analysis

For a detailed description of these tasks and the models, please refer to the paper.

Sample training commands:

cd src
python main.py --data_root_dir /path/to/detection --data_type detection
python main.py --data_root_dir /path/to/structure --data_type structure

GriTS metric evaluation

GriTS metrics proposed in the paper can be evaluated once you have trained a model. We consider the model trained in the previous step. This script calculates all 4 variations presented in the paper. Based on the model, one can tune which variation to use. The table words dir path is not required for all variations but we use it in our case as PubTables1M contains this information.

python main.py --data_root_dir /path/to/structure --model_load_path /path/to/model --table_words_dir /path/to/table/words --mode grits

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Data, model training, and evaluation code for "PubTables-1M: Towards a universal dataset and metrics for training and evaluating table extraction models".

Related tags

Overview

PubTables-1M

News

Getting the Data

Code Installation

Model Training

GriTS metric evaluation

Contributing

Trademarks

Owner

Microsoft

NuPIC Studio is an all-in-one tool that allows users create a HTM neural network from scratch

PyTorch ,ONNX and TensorRT implementation of YOLOv4

Learning embeddings for classification, retrieval and ranking.

Detecting drunk people through thermal images using Deep Learning (CNN)

Pytorch implementation of One-Shot Affordance Detection

This is a Keras-based Python implementation of DeepMask- a complex deep neural network for learning object segmentation masks

A keras-based real-time model for medical image segmentation (CFPNet-M)

AWS provides a Python SDK, "Boto3" ,which can be used to access the AWS-account from the local.

Kaggle-titanic - A tutorial for Kaggle's Titanic: Machine Learning from Disaster competition. Demonstrates basic data munging, analysis, and visualization techniques. Shows examples of supervised machine learning techniques.

Predicting Tweet Sentiment Maching Learning and streamlit

Perspective: Julia for Biologists

A flexible submap-based framework towards spatio-temporally consistent volumetric mapping and scene understanding.

Phylogeny Partners

PyTorch implementation of the end-to-end coreference resolution model with different higher-order inference methods.

EMNLP 2021 Findings' paper, SCICAP: Generating Captions for Scientific Figures

For visualizing the dair-v2x-i dataset

PyTorch implementations of the paper: "DR.VIC: Decomposition and Reasoning for Video Individual Counting, CVPR, 2022"

Torch implementation of SegNet and deconvolutional network

[SIGGRAPH Asia 2021] DeepVecFont: Synthesizing High-quality Vector Fonts via Dual-modality Learning.

Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"

Data, model training, and evaluation code for "PubTables-1M: Towards a universal dataset and metrics for training and evaluating table extraction models".

Related tags

Overview

PubTables-1M

News

Getting the Data

Code Installation

Model Training

GriTS metric evaluation

Contributing

Trademarks

Owner

Microsoft

NuPIC Studio is an all­-in-­one tool that allows users create a HTM neural network from scratch

PyTorch ,ONNX and TensorRT implementation of YOLOv4

Learning embeddings for classification, retrieval and ranking.

Detecting drunk people through thermal images using Deep Learning (CNN)

Pytorch implementation of One-Shot Affordance Detection

This is a Keras-based Python implementation of DeepMask- a complex deep neural network for learning object segmentation masks

A keras-based real-time model for medical image segmentation (CFPNet-M)

AWS provides a Python SDK, "Boto3" ,which can be used to access the AWS-account from the local.

Kaggle-titanic - A tutorial for Kaggle's Titanic: Machine Learning from Disaster competition. Demonstrates basic data munging, analysis, and visualization techniques. Shows examples of supervised machine learning techniques.

Predicting Tweet Sentiment Maching Learning and streamlit

Perspective: Julia for Biologists

A flexible submap-based framework towards spatio-temporally consistent volumetric mapping and scene understanding.

Phylogeny Partners

PyTorch implementation of the end-to-end coreference resolution model with different higher-order inference methods.

EMNLP 2021 Findings' paper, SCICAP: Generating Captions for Scientific Figures

For visualizing the dair-v2x-i dataset

PyTorch implementations of the paper: "DR.VIC: Decomposition and Reasoning for Video Individual Counting, CVPR, 2022"

Torch implementation of SegNet and deconvolutional network

[SIGGRAPH Asia 2021] DeepVecFont: Synthesizing High-quality Vector Fonts via Dual-modality Learning.

Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"

NuPIC Studio is an all-in-one tool that allows users create a HTM neural network from scratch