Tools for parsing messy tabular data.

Last update: Nov 10, 2022

Related tags

Pipelines messytables

Overview

Parsing for messy tables

A library for dealing with messy tabular data in several formats, guessing types and detecting headers.

See the documentation at: https://messytables.readthedocs.io

Find the package at: https://pypi.python.org/pypi/messytables

See CONTRIBUTING.md for how to send patches, run tests.

Contact: Open Knowledge Labs - http://okfnlabs.org/contact/. We especially recommend the forum: http://discuss.okfn.org/category/open-knowledge-labs/

Comments

HTMLTableSet
Hi, here's a HTML Table Set importer for messytables.

It's not fantastic yet; but it's a pretty good start

Supports rowspan/colspan - currently by inserting blank cells.

Supports multiple TABLE elements - but may have unexpected behaviour where there are nested tables.

Doesn't attempt to handle tables that aren't using TABLE, TR, TD, TH.

Not enormously well tested, but seems to work on the tables I've fed it so far.

Requires lxml.

It's the first time I've ever made a pull request; let us know if there's anything we can do to improve it for you.
opened by scraperdragon 12

All releases BROKEN due to json-table-schema name change

json-table-schema is a broken dependency as of yesterday. This affects current and previous releases on pypi.

To fix this at this end we've changed the dep https://github.com/okfn/messytables/pull/143 and now messytables installs from source again, but it needs a release to pypi. I don't have permission for this.

(test)[email protected]:/tmp$ pip install messytables
Downloading/unpacking messytables
  Downloading messytables-0.15.0.tar.gz
  Running setup.py egg_info for package messytables

Downloading/unpacking xlrd>=0.8.0 (from messytables)
  Downloading xlrd-0.9.4.tar.gz (322Kb): 322Kb downloaded
  Running setup.py egg_info for package xlrd

Downloading/unpacking python-magic>=0.4.6 (from messytables)
  Downloading python-magic-0.4.10.tar.gz
  Running setup.py egg_info for package python-magic

    no previously-included directories found matching 'test'
Downloading/unpacking chardet>=2.3.0 (from messytables)
  Downloading chardet-2.3.0.tar.gz (164Kb): 164Kb downloaded
  Running setup.py egg_info for package chardet

    warning: no files found matching 'COPYING'
    warning: no files found matching '*.html' under directory 'docs'
    warning: no files found matching '*.css' under directory 'docs'
    warning: no files found matching '*.png' under directory 'docs'
    warning: no files found matching '*.gif' under directory 'docs'
Downloading/unpacking python-dateutil>=2.4.2 (from messytables)
  Downloading python-dateutil-2.4.2.tar.gz (209Kb): 209Kb downloaded
  Running setup.py egg_info for package python-dateutil

Downloading/unpacking lxml>=3.2 (from messytables)
  Downloading lxml-3.5.0b1.tar.gz (3.8Mb): 3.8Mb downloaded
  Running setup.py egg_info for package lxml
    Building lxml version 3.5.0b1.
    Building without Cython.
    Using build configuration of libxslt 1.1.26
    Building against libxml2/libxslt in the following directory: /usr/lib/x86_64-linux-gnu

    warning: no previously-included files found matching '*.py'
Downloading/unpacking requests (from messytables)
  Downloading requests-2.8.1.tar.gz (480Kb): 480Kb downloaded
  Running setup.py egg_info for package requests

Downloading/unpacking html5lib (from messytables)
  Downloading html5lib-1.0b8.tar.gz (889Kb): 889Kb downloaded
  Running setup.py egg_info for package html5lib

Downloading/unpacking json-table-schema>=0.2 (from messytables)
  Downloading json-table-schema-0.5.0.tar.gz
  Running setup.py egg_info for package json-table-schema
    json-table-schema has been replaced by jsontableschema. See https://github.com/okfn/json-table-schema-py-old for details.
    Traceback (most recent call last):
      File "<string>", line 14, in <module>
      File "/tmp/test/build/json-table-schema/setup.py", line 16, in <module>
        with io.open(README_PATH, mode='r+t', encoding='utf-8') as stream:
    IOError: [Errno 2] No such file or directory: '/tmp/test/build/json-table-schema/README.md'
    Complete output from command python setup.py egg_info:
    json-table-schema has been replaced by jsontableschema. See https://github.com/okfn/json-table-schema-py-old for details.

Traceback (most recent call last):

  File "<string>", line 14, in <module>

  File "/tmp/test/build/json-table-schema/setup.py", line 16, in <module>

    with io.open(README_PATH, mode='r+t', encoding='utf-8') as stream:

IOError: [Errno 2] No such file or directory: '/tmp/test/build/json-table-schema/README.md'

----------------------------------------
Command python setup.py egg_info failed with error code 1 in /tmp/test/build/json-table-schema
Storing complete log in /home/co/.pip/pip.log

opened by davidread 11

Getting messytables to run on Python 3

Does any know, informally or otherwise, what it will take to get messytables running on Python 3?

I'm keen to use various functions and modules from messytables, but I'm trying to maintain 2.7/3.3/3.4 support in my own libraries.

opened by pwalsh 11
Application for maintainership

Hey all. This repository seems to be semi-inactive, and it unclear to me what the path to merging a PR like #171 is (who would have to approve?). I use messytables in production code day to day, and this lack of clarity on process makes the library a liability. My understanding is that okfn's resources and interest is focussed on goodtables and the frictionlessdata toolchain.

I would therefore like to apply to become the maintainer for messytables, merge #171 & co., and generally make sure that changes in this thing are handled and bugs are actively tracked.

Thoughts, @pwalsh, @davidread, @rufuspollock? Please let me know.

opened by pudo 10
TypeError("object of type 'float' has no len()",) when calling type_guess

I could trace this back to #141 where len() is being used in the test() method of DateUtilType.

I think there should be a try/except block around that, that catches this TypeError. But I'm not too familiar with the code, so I'm basically asking if you agree, or if I'm missing something.

I'm happy to provide the PR.

BTW: I'm getting this error via datapusher on some Excel sheet that is being parsed with the default parameters. The excel sheet has indeed a lot of float values in it.

opened by metaodi 10
[discussion] messytables should *only* work with local files

Messytables doesn't work well in a lot of situations when the provided fileobj is a socket.

The BufferedFile object attempts to resolve this, but in a lot of cases it will force a read(-1) and cause a complete download of the file (into ram) anyway. This is particularly true of anything that that wants to seek within the file (such as zip and xls) or the buffer passed to magic.from_buffer (which is inadequate in some cases and from_file would be more accurate).

Downloading the content to temporary storage isn't an onerous task, and if the interface was modified to use filenames instead of file-objects it could even transparently download the content when a url is provided (which is is destined to do anyway at some point).
question

opened by rossjones 10
Support for PDF format

We've been exploring different options for parsing PDFs. Currently we're using an (alpha) in-house library called pdftables (we blogged about it here)

This pull request integrates pdftables into messytables. It is an optional requirement - if pdftables is not installed, messytables will work as usual and the PDF tests will be skipped.

We're looking into other ways of extracting tables from PDFs, but either way we'll need the messytables integration.

opened by fawkesley 9
[WIP] Support for ODS files.

A reworked reader for ODS files that doesn't use any broken third-party libraries. Reads the .xml directly from the zipfile and performs much better on larger spreadsheets.

opened by rossjones 9

libmagic error following messytables overview

I'm based off of http://messytables.readthedocs.org/en/latest/ but have also looked at the GitHub readme, etc. Couldn't find any actual install instructions anywhere, but here's what I did.

Environment: Mac OS X latest, up to date homebrew

pip install messytables
brew install libmagic

The following Python:

% python                
Python 2.7.6 (default, Nov 14 2013, 09:55:56) 
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import messytables
>>> messytables.any_tableset(open('README.txt', 'rb'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/messytables/any.py", line 138, in any_tableset
    magic_mime = get_mime(fileobj)
  File "/usr/local/lib/python2.7/site-packages/messytables/any.py", line 38, in get_mime
    mimetype = magic.from_buffer(header, mime=True)
  File "build/bdist.macosx-10.9-x86_64/egg/magic.py", line 103, in from_buffer
    def __init__(self, ms):
  File "build/bdist.macosx-10.9-x86_64/egg/magic.py", line 94, in _get_magic_type
    _list = _libraries['magic'].magic_list
  File "build/bdist.macosx-10.9-x86_64/egg/magic.py", line 83, in _get_magic_mime
    _load.restype = c_int
  File "build/bdist.macosx-10.9-x86_64/egg/magic.py", line 51, in __init__
    magic_set._fields_ = []
  File "build/bdist.macosx-10.9-x86_64/egg/magic.py", line 138, in errorcheck
    except:
magic.MagicException: no magic files loaded

README.txt:

============
README
============

A single-line README file.

opened by dhalperi 8

Remove openpyxl, use XLSTableSet for XLSX files

Phase 1 of 2 for completely removing openpyxl and using XLSTableSet instead. (Phase 2 will actually remove the dependency and excelx.py, then you won't be able to reference XLSXTableSet)

If you always use any_tableset it'll just work correctly - you'll now get back an XLSTableSet instead of an XLSXTableSet.

I've left the latter in with a DeprecationWarning (and test) in order to remain compatible with code written with explicity XLSXTableSet.

I'm feeling like we should encourage people towards only using any_tableset (perhaps with an argument to override force the type detection). It's quite awkward that currently our users are needlessly coupling to our class naming convention. Unless I've missed a use-case - any compelling reasons to allow that?

Not ready to merge yet I suspect. Closes #83

opened by fawkesley 8
65 rework of detection in any.py
We were having problems with any.py, so I rewrote it.

Features:

new extension detection function (you can pass a whole filename/URL)

nice lists of mimetypes/extensions parsed

special pleading for XLS/XLSX files :(

tests for autodetection

various fixes
opened by scraperdragon 8

Failure to load with Python 3.10

Attempting to use messytables with Python 3.10 results in the following error:

  File "/layers/google.python.pip/pip/lib/python3.10/site-packages/messytables/core.py", line 2, in <module>
    from collections import Mapping
ImportError: cannot import name 'Mapping' from 'collections' (/opt/python3.10/lib/python3.10/collections/__init__.py)

This is due to Mapping moving to package collections.abc in Python 3.10.

core.py should be updated to take account of this.

opened by davidharcombe 0

Bump lxml from 4.3.4 to 4.9.1
Bumps lxml from 4.3.4 to 4.9.1.

Changelog

Sourced from lxml's changelog.

4.9.1 (2022-07-01)

Bugs fixed

A crash was resolved when using iterwalk() (or canonicalize()) after parsing certain incorrect input. Note that iterwalk() can crash on valid input parsed with the same parser after failing to parse the incorrect input.

4.9.0 (2022-06-01)

Bugs fixed

GH#341: The mixin inheritance order in lxml.html was corrected. Patch by xmo-odoo.

Other changes

Built with Cython 0.29.30 to adapt to changes in Python 3.11 and 3.12.

Wheels include zlib 1.2.12, libxml2 2.9.14 and libxslt 1.1.35 (libxml2 2.9.12+ and libxslt 1.1.34 on Windows).

GH#343: Windows-AArch64 build support in Visual Studio. Patch by Steve Dower.

4.8.0 (2022-02-17)

Features added

GH#337: Path-like objects are now supported throughout the API instead of just strings. Patch by Henning Janssen.

The ElementMaker now supports QName values as tags, which always override the default namespace of the factory.

Bugs fixed

GH#338: In lxml.objectify, the XSI float annotation "nan" and "inf" were spelled in lower case, whereas XML Schema datatypes define them as "NaN" and "INF" respectively.

... (truncated)

Commits

d01872c Prevent parse failure in new test from leaking into later test runs.

d65e632 Prepare release of lxml 4.9.1.

86368e9 Fix a crash when incorrect parser input occurs together with usages of iterwa...

50c2764 Delete unused Travis CI config and reference in docs (GH-345)

8f0bf2d Try to speed up the musllinux AArch64 build by splitting the different CPytho...

b9f7074 Remove debug print from test.

b224e0f Try to install 'xz' in wheel builds, if available, since it's now needed to e...

897ebfa Update macOS deployment target version from 10.14 to 10.15 since 10.14 starts...

853c9e9 Prepare release of 4.9.0.

d3f77e6 Add a test for https://bugs.launchpad.net/lxml/+bug/1965070 leaving out the a...

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0
messytables guesses wrong type for decimal number
Describe the bug Messytables should guess decimals correctly respecting the locale configuration. For example: In germany the , is used as decimal dot but a value 1,200 is guessed as type "text".

This issue was initially reported as ckan issue https://github.com/ckan/ckan/issues/5769 where I recognized it.

The type guessing seems to happen here: https://github.com/okfn/messytables/blob/51b736892a48e420ab313675f54901c77b446dec/messytables/types.py and seems to happen locale specific. (I think the magic happens in line 100: value = locale.atof(value)

Unfortunately python seems to recognizes a dot as decimal point even if a german locale is set, which I could reproduce in my local environment:

>>> locale.getlocale() ('de_DE', 'cp1252') >>> locale.atof('1,200') Traceback (most recent call last): File "<pyshell#35>", line 1, in <module> locale.atof('1,200') File "C:\Program Files\Python27\lib\locale.py", line 318, in atof return func(string) ValueError: invalid literal for float(): 1,200 >>> locale.localeconv() {'mon_decimal_point': '', 'int_frac_digits': 127, 'p_sep_by_space': 127, 'frac_digits': 127, 'thousands_sep': '', 'n_sign_posn': 127, 'decimal_point': '.', 'int_curr_symbol': '', 'n_cs_precedes': 127, 'p_sign_posn': 127, 'mon_thousands_sep': '', 'negative_sign': '', 'currency_symbol': '', 'n_sep_by_space': 127, 'mon_grouping': [], 'p_cs_precedes': 127, 'positive_sign': '', 'grouping': []}
opened by wrinklenose 1
test_attempt_read_encrypted_no_password_xls failure in Python 3.7+
This line specifies an error message. In the test, the text of the exception caused by the code under test is expected to match exactly.

errmsg = "Can't read Excel file: XLRDError('Workbook is encrypted',)"

When running tests on Python 3.7 and 3.8 this fails, because their outputs do not contain the comma (probably due to this change in Python 3.7, I'm guessing).
opened by StevenMaude 0
requirements-test.txt should have xlrd==1.2.0 (or >=) for Python 3.8+ tests

This version of xlrd is currently pinned for testing on Travis in requirements-test.txt.

Prior to v1.2.0, xlrd used the time.clock() function inside book.py and this was removed in Python 3.8.

opened by StevenMaude 0

Releases(0.15.1)

0.15.1(Sep 29, 2016)

Source code(tar.gz)
Source code(zip)

Owner

Open Knowledge Foundation

Also find us at: @frictionlessdata @opentrials @openspending @openknowledge-archive

GitHub Repository http://messytables.readthedocs.io/

Easy pipelines for pandas DataFrames.

pdpipe ˨ Easy pipelines for pandas DataFrames (learn how!). Website: https://pdpipe.github.io/pdpipe/ Documentation: https://pdpipe.github.io/pdpipe/d

694 Jan 05, 2023

Build, test, deploy, iterate - Dev and prod tool for data science pipelines

Prodmodel is a build system for data science pipelines. Users, testers, contributors are welcome! Motivation · Concepts · Installation · Usage · Contr

53 Nov 29, 2022

functional data manipulation for pandas

pandas-ply: functional data manipulation for pandas pandas-ply is a thin layer which makes it easier to manipulate data with pandas. In particular, it

188 Nov 24, 2022

Pandas integration with sklearn

Sklearn-pandas This module provides a bridge between Scikit-Learn's machine learning methods and pandas-style Data Frames. In particular, it provides

2.7k Dec 27, 2022

BatchFlow helps you conveniently work with random or sequential batches of your data and define data processing and machine learning workflows even for datasets that do not fit into memory.

BatchFlow BatchFlow helps you conveniently work with random or sequential batches of your data and define data processing and machine learning workflo

185 Dec 20, 2022

Tools for parsing messy tabular data.

Related tags

Overview

Parsing for messy tables

Comments

4.9.1 (2022-07-01)

Bugs fixed

4.9.0 (2022-06-01)

Bugs fixed

Other changes

4.8.0 (2022-02-17)

Features added

Bugs fixed

Releases(0.15.1)

0.15.1(Sep 29, 2016)

Owner

Open Knowledge Foundation

Easy pipelines for pandas DataFrames.

Build, test, deploy, iterate - Dev and prod tool for data science pipelines

functional data manipulation for pandas

Pandas integration with sklearn

BatchFlow helps you conveniently work with random or sequential batches of your data and define data processing and machine learning workflows even for datasets that do not fit into memory.

Directions overlay for working with pandas in an analysis environment

Tools for parsing messy tabular data.

Clean APIs for data cleaning. Python implementation of R package Janitor

dplyr for python

Microsoft Azure provides a wide number of services for managing and storing data

A Python toolkit for processing tabular data