Dirty, ugly, and hopefully useful OCR of Facebook Papers docs released by Gizmodo

Last update: Oct 28, 2021

Related tags

Overview

Quick and Dirty OCR of Facebook Papers

Gizmodo has been working through the Facebook Papers and releasing the docs that they process and review.

As luck would have it, I had some ugly but functional code lying around that would do a first pass on OCR on these docs. That code is in the pdf_to_image.py script. I'd welcome improvement to the code, especially in image cleanup prior to OCR (lines 92-97, approx). I experimented with cleaning up the image via PIL and cv2, but the results were less accurate, almost certainly due to my lack of familiarity with either of these approaches.

These Facebook Papers are especially challenging from an OCR perspective because many of them are pictures taken of a screen, so the base image quality isn't especially good. Because of this, not every document can be processed cleanly, and the documents that do get processed have some cruft in them.

With that said, the text pulled from these files simplifies the process of parsing through a large amount of data for keywords.

Other (Better) Options

This OCR should be seen as a first step. Text files are generally a decent starting point because they allow for a wide range of follow on analysis.

And, other/better options exist. For a comprehensive, contained analysis, these other options will almost certainly be a better choice.

Want to help?

If you want to collaborate on this project, let me know!

Dirty, ugly, and hopefully useful OCR of Facebook Papers docs released by Gizmodo

Related tags

Overview

Quick and Dirty OCR of Facebook Papers

Other (Better) Options

Want to help?

Owner

Bill Fitzgerald

Using computer vision method to recognize and calcutate the features of the architecture.

a deep learning model for page layout analysis / segmentation.

7th place solution

Face Anonymizer - FaceAnonApp v1.0

SCOUTER: Slot Attention-based Classifier for Explainable Image Recognition

Volume Control using OpenCV

Generic framework for historical document processing

Extract tables from scanned image PDFs using Optical Character Recognition.

Text Detection from images using OpenCV

Pure Javascript OCR for more than 100 Languages 📖🎉🖥

The project is an official implementation of our paper "3D Human Pose Estimation with Spatial and Temporal Transformers".

Text to QR-CODE

Python Computer Vision application that allows users to draw/erase on the screen using their webcam.

Color Picker and Color Detection tool for METR4202

A machine learning software for extracting information from scholarly documents

This is an API written in python that uses FastAPI. It is a simple API that can detect discord tokens in Images.

[ICCV, 2021] Cloud Transformers: A Universal Approach To Point Cloud Processing Tasks

SRA's seminar on Introduction to Computer Vision Fundamentals

Text recognition (optical character recognition) with deep learning methods.

Sign Language Recognition service utilizing a deep learning model with Long Short-Term Memory to perform sign language recognition.