Quick and Dirty OCR of Facebook Papers

Gizmodo has been working through the Facebook Papers and releasing the docs that they process and review.

As luck would have it, I had some ugly but functional code lying around that would do a first pass on OCR on these docs. The initial release used that code, but based on some helpful feedback I was able to make some improvements, and the updated code is in That code is in the pdf_scan.py script. As always, suggestions for improvement are welcome.

The Facebook Papers are especially challenging from an OCR perspective because many of them are pictures taken of a screen, so the base image quality isn't especially good. Because of this, not every document can be processed cleanly, and the documents that do get processed have some cruft in them.

With that said, the text pulled from these files simplifies the process of parsing through a large amount of data for keywords.

Where are the cleaned up docs?

The processed_text directory contains the cleaned files. Specifically, this directory contains:

cleaned up pdfs
the text extracted from the pdfs

The text extraction quality is decent, but there are still artifacts and cruft.

Other (Better) Options

This OCR should be seen as a first step. Text files are generally a decent starting point because they allow for a wide range of follow on analysis.

And, other/better options exist. For a comprehensive, contained analysis, these other options will almost certainly be a better choice.

Want to help?

If you want to collaborate on this project, let me know!

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
processed_text		processed_text
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
before_after.png		before_after.png
pdf_scan.py		pdf_scan.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

processed_text

processed_text

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

before_after.png

before_after.png

pdf_scan.py

pdf_scan.py

Repository files navigation

Quick and Dirty OCR of Facebook Papers

Where are the cleaned up docs?

Other (Better) Options

Want to help?

About

Releases

Packages

Languages

License

billfitzgerald/facebook_papers_ocr

Folders and files

Latest commit

History

Repository files navigation

Quick and Dirty OCR of Facebook Papers

Where are the cleaned up docs?

Other (Better) Options

Want to help?

About

Resources

License

Stars

Watchers

Forks

Languages