Dirty, ugly, and hopefully useful OCR of Facebook Papers docs released by Gizmodo

Last update: Oct 28, 2021

Related tags

Overview

Quick and Dirty OCR of Facebook Papers

Gizmodo has been working through the Facebook Papers and releasing the docs that they process and review.

As luck would have it, I had some ugly but functional code lying around that would do a first pass on OCR on these docs. That code is in the pdf_to_image.py script. I'd welcome improvement to the code, especially in image cleanup prior to OCR (lines 92-97, approx). I experimented with cleaning up the image via PIL and cv2, but the results were less accurate, almost certainly due to my lack of familiarity with either of these approaches.

These Facebook Papers are especially challenging from an OCR perspective because many of them are pictures taken of a screen, so the base image quality isn't especially good. Because of this, not every document can be processed cleanly, and the documents that do get processed have some cruft in them.

With that said, the text pulled from these files simplifies the process of parsing through a large amount of data for keywords.

Other (Better) Options

This OCR should be seen as a first step. Text files are generally a decent starting point because they allow for a wide range of follow on analysis.

And, other/better options exist. For a comprehensive, contained analysis, these other options will almost certainly be a better choice.

Want to help?

If you want to collaborate on this project, let me know!

Dirty, ugly, and hopefully useful OCR of Facebook Papers docs released by Gizmodo

Related tags

Overview

Quick and Dirty OCR of Facebook Papers

Other (Better) Options

Want to help?

Owner

Bill Fitzgerald

Generate a list of papers with publicly available source code in the daily arxiv

Detecting Text in Natural Image with Connectionist Text Proposal Network (ECCV'16)

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

Pixel art search engine for opengameart

STEFANN: Scene Text Editor using Font Adaptive Neural Network

Scan the MRZ code of a passport and extract the firstname, lastname, passport number, nationality, date of birth, expiration date and personal numer.

Repository for playing the computer vision apps: People analytics on Raspberry Pi.

Textboxes_plusplus implementation with Tensorflow (python)

Python package for handwriting and sketching in Jupyter cells

Image processing in Python

Source code of RRPN ---- Arbitrary-Oriented Scene Text Detection via Rotation Proposals

Augmenting Anchors by the Detector Itself

An easy to use an (hopefully useful) captcha solution for pyTelegramBotAPI

Document manipulation detection with python

Lightning Fast Language Prediction 🚀

Text page dewarping using a "cubic sheet" model

Connect Aseprite to Blender for painting pixelart textures in real time

Detect and fix skew in images containing text

A simple Security Camera created using Opencv in Python where images gets saved in realtime in your Dropbox account at every 5 seconds