Implementation of DocFormer: End-to-End Transformer for Document Understanding, a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU)

Last update: Jan 06, 2023

Related tags

Deep Learning docformer

Overview

DocFormer - PyTorch

Implementation of DocFormer: End-to-End Transformer for Document Understanding, a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU) 📄 📄 📄 .

DocFormer is a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU). In addition, DocFormer is pre-trained in an unsupervised fashion using carefully designed tasks which encourage multi-modal interaction. DocFormer uses text, vision and spatial features and combines them using a novel multi-modal self-attention layer. DocFormer also shares learned spatial embeddings across modalities which makes it easy for the model to correlate text to visual tokens and vice versa. DocFormer is evaluated on 4 different datasets each with strong baselines. DocFormer achieves state-of-the-art results on all of them, sometimes beating models 4x its size (in no. of parameters).

The official implementation was not released by the authors.

Install

There might be some issues with the import of pytessaract, so in order to debug that, we need to write

pip install pytesseract
sudo apt install tesseract-ocr

And then,

pip install git+https://github.com/shabie/docformer

Usage

from docformer import modeling, dataset
from transformers import BertTokenizerFast


config = {
  "coordinate_size": 96,
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "image_feature_pool_shape": [7, 7, 256],
  "intermediate_ff_size_factor": 4,
  "max_2d_position_embeddings": 1000,
  "max_position_embeddings": 512,
  "max_relative_positions": 8,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "shape_size": 96,
  "vocab_size": 30522,
  "layer_norm_eps": 1e-12,
}

fp = "filepath/to/the/image.tif"

tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
encoding = dataset.create_features(fp, tokenizer)

feature_extractor = modeling.ExtractFeatures(config)
docformer = modeling.DocFormerEncoder(config)
v_bar, t_bar, v_bar_s, t_bar_s = feature_extractor(encoding)
output = docformer(v_bar, t_bar, v_bar_s, t_bar_s)  # shape (1, 512, 768)

License

MIT

Maintainers

Contribute

Citations

@InProceedings{Appalaraju_2021_ICCV,
    author    = {Appalaraju, Srikar and Jasani, Bhavan and Kota, Bhargava Urala and Xie, Yusheng and Manmatha, R.},
    title     = {DocFormer: End-to-End Transformer for Document Understanding},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2021},
    pages     = {993-1003}
}

Implementation of DocFormer: End-to-End Transformer for Document Understanding, a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU)

Related tags

Overview

DocFormer - PyTorch

Install

Usage

License

Maintainers

Contribute

Citations

Owner

NCVX (NonConVeX): A User-Friendly and Scalable Package for Nonconvex Optimization in Machine Learning.

Repository for paper "Non-intrusive speech intelligibility prediction from discrete latent representations"

Robust Consistent Video Depth Estimation

Notes, programming assignments and quizzes from all courses within the Coursera Deep Learning specialization offered by deeplearning.ai

Revealing and Protecting Labels in Distributed Training

Pytorch implementation of BRECQ, ICLR 2021

Super-Fast-Adversarial-Training - A PyTorch Implementation code for developing super fast adversarial training

A Python package for generating concise, high-quality summaries of a probability distribution

NeuroMorph: Unsupervised Shape Interpolation and Correspondence in One Go

For encoding a text longer than 512 tokens, for example 800. Set max_pos to 800 during both preprocessing and training.

Research code for Arxiv paper "Camera Motion Agnostic 3D Human Pose Estimation"

DGL-TreeSearch and the Gurobi-MWIS interface

Implementation of self-attention mechanisms for general purpose. Focused on computer vision modules. Ongoing repository.

PyTorch implementation of "A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech Enhancement."

[NeurIPS2021] Code Release of Learning Transferable Perturbations

N-Person-Check-Checker-Splitter - A calculator app use to divide checks

Fuzzing tool (TFuzz): a fuzzing tool based on program transformation

CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP

CBREN: Convolutional Neural Networks for Constant Bit Rate Video Quality Enhancement

PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation