Convert tables stored as images to an usable .csv file

Overview

Convert an image of numbers to a .csv file

This Python program aims to convert images of array numbers to corresponding .csv files. It uses OpenCV for Python to process the given image and Tesseract for number recognition.

Output Example

The repository includes:

  • the source code of image2csv.py,
  • the tools.py file where useful functions are implemented,
  • the grid_detector.py file to perform automatic grid detection,
  • a folder with some files used for test.

The code is not well documented nor fully efficient as I'm a beginner in programming, and this project is a way for me to improve my skills, in particular in Python programming.

How to use the program

First of all, the user must install the needed packages:

$ pip install -r requirements.txt   

as well as Tesseract.

Then, in a python terminal, use the command line:

$ python image2csv.py --image path/to/image

There are a few optionnal arguments:

  • --path path/to/output/csv/file
  • --grid [False]/True
  • --visualization [y]/n
  • --method [fast]/denoize

and one can find their usage using the command line:

$ python image2csv.py --help

By default, the program will try to detect a grid automatically. This detection uses OpenCV's Hough transformation and Canny detection, so the user can tweak a few parameters for better processing in the grid_detector.py file.

When then program is running with manual grid detection, the user has to interact with it via its mouse and the terminal :

  1. the image is opened in a window for the user to draw a rectangle around the first (top left) number. As this rectangle is used as a base to create a grid afterward, keep in mind that all the numbers should fit into the box.
  2. A new window is opened showing the image with the drawn rectangle. Press any key to close and continue.
  3. Based on the drawn rectangle, a grid is created to extract each number one by one. This grid is controlled by the user via two "offset" values. The user has to enter those values in the terminal, then the image is opened in a window with the created grid. Press any key to close and continue. If the numbers does not fit into the grid, the user can change the offset values and repeat this step. When the grid matches the user's expectations, he can set both of the offset values to 0 to continue.
  4. The numbers are extracted from the image and the results are shown in the terminal. (be carefoul though, the indicated number of errors represents the number of errors encountered by Tesseract, but Tesseract can identify a wrong number which will not be counted as an error !)
  5. The .csv file is created with the numbers identified by Tesseract. If Tesseract finds an error, it will show up on the .csv file as an infinite value.

Hypothesis and limits

For the program to run correctly, the input image must verify some hypothesis (just a few simple ones):

  • for manual selection, the line and row width must be constants, as the build grid is just a repetition of the initial rectangle with offsets;
  • to use automatic grid detection, a full and clear grid, with external borders, must be visible;
  • it is recommended to have a good input image resolution, to control the offsets more easily.

At last, this program is not perfect (I know you thought so, with its smooth workflow and simple hypothesis, sorry to disappoint...) and does not work with decimal numbers... But does a great job on negatives ! Also the user must be careful with the slashed zero which seems to be identified by Tesseract as a six.

Credits

For image pre-processing in the tool.py file I used a useful function implemented by @Nitish9711 for his Automatic-Number-plate-detection (https://github.com/Nitish9711/Automatic-Number-plate-detection.git).

Owner
Beginning in the programming world with the help of @29jm, holy builder of the very special SnowflakeOS. Student at the École Centrale de Lille (FR).
Datashader is a data rasterization pipeline for automating the process of creating meaningful representations of large amounts of data.

Datashader is a data rasterization pipeline for automating the process of creating meaningful representations of large amounts of data.

HoloViz 2.9k Jan 06, 2023
Spectral Analysis in Python

SPECTRUM : Spectral Analysis in Python contributions: Please join https://github.com/cokelaer/spectrum contributors: https://github.com/cokelaer/spect

Thomas Cokelaer 280 Dec 16, 2022
Extract data from a wide range of Internet sources into a pandas DataFrame.

pandas-datareader Up to date remote data access for pandas, works for multiple versions of pandas. Installation Install using pip pip install pandas-d

Python for Data 2.5k Jan 09, 2023
Python for Data Analysis, 2nd Edition

Python for Data Analysis, 2nd Edition Materials and IPython notebooks for "Python for Data Analysis" by Wes McKinney, published by O'Reilly Media Buy

Wes McKinney 18.6k Jan 08, 2023
songplays datamart provide details about the musical taste of our customers and can help us to improve our recomendation system

Songplays User activity datamart The following document describes the model used to build the songplays datamart table and the respective ETL process.

Leandro Kellermann de Oliveira 1 Jul 13, 2021
PandaPy has the speed of NumPy and the usability of Pandas 10x to 50x faster (by @firmai)

PandaPy "I came across PandaPy last week and have already used it in my current project. It is a fascinating Python library with a lot of potential to

Derek Snow 527 Jan 02, 2023
Data processing with Pandas.

Processing-data-with-python This is a simple example showing how to use Pandas to create a dataframe and the processing data with python. The jupyter

1 Jan 23, 2022
PyChemia, Python Framework for Materials Discovery and Design

PyChemia, Python Framework for Materials Discovery and Design PyChemia is an open-source Python Library for materials structural search. The purpose o

Materials Discovery Group 61 Oct 02, 2022
a tool that compiles a csv of all h1 program stats

h1stats - h1 Program Stats Scraper This python3 script will call out to HackerOne's graphql API and scrape all currently active programs for informati

Evan 40 Oct 27, 2022
PyNHD is a part of HyRiver software stack that is designed to aid in watershed analysis through web services.

A part of HyRiver software stack that provides access to NHD+ V2 data through NLDI and WaterData web services

Taher Chegini 23 Dec 14, 2022
Predictive Modeling & Analytics on Home Equity Line of Credit

Predictive Modeling & Analytics on Home Equity Line of Credit Data (Python) HMEQ Data Set In this assignment we will use Python to examine a data set

Dhaval Patel 1 Jan 09, 2022
Very useful and necessary functions that simplify working with data

Additional-function-for-pandas Very useful and necessary functions that simplify working with data random_fill_nan(module_name, nan) - Replaces all sp

Alexander Goldian 2 Dec 02, 2021
Churn prediction with PySpark

It is expected to develop a machine learning model that can predict customers who will leave the company.

3 Aug 13, 2021
Unsub is a collection analysis tool that assists libraries in analyzing their journal subscriptions.

About Unsub is a collection analysis tool that assists libraries in analyzing their journal subscriptions. The tool provides rich data and a summary g

9 Nov 16, 2022
This module is used to create Convolutional AutoEncoders for Variational Data Assimilation

VarDACAE This module is used to create Convolutional AutoEncoders for Variational Data Assimilation. A user can define, create and train an AE for Dat

Julian Mack 23 Dec 16, 2022
A model checker for verifying properties in epistemic models

Epistemic Model Checker This is a model checker for verifying properties in epistemic models. The goal of the model checker is to check for Pluralisti

Thomas Träff 2 Dec 22, 2021
Udacity-api-reporting-pipeline - Udacity api reporting pipeline

udacity-api-reporting-pipeline In this exercise, you'll use portions of each of

Fabio Barbazza 1 Feb 15, 2022
WithPipe is a simple utility for functional piping in Python.

A utility for functional piping in Python that allows you to access any function in any scope as a partial.

Michael Milton 1 Oct 26, 2021
fds is a tool for Data Scientists made by DAGsHub to version control data and code at once.

Fast Data Science, AKA fds, is a CLI for Data Scientists to version control data and code at once, by conveniently wrapping git and dvc

DAGsHub 359 Dec 22, 2022
AptaMat is a simple script which aims to measure differences between DNA or RNA secondary structures.

AptaMAT Purpose AptaMat is a simple script which aims to measure differences between DNA or RNA secondary structures. The method is based on the compa

GEC UTC 3 Nov 03, 2022