Tools for test driven data-wrangling and data validation.

Last update: Dec 16, 2022

Overview

datatest: Test driven data-wrangling and data validation

Datatest helps to speed up and formalize data-wrangling and data validation tasks. It implements a system of validation methods, difference classes, and acceptance managers. Datatest can help you:

Clean and wrangle data faster and more accurately.
Maintain a record of checks and decisions regarding important data sets.
Distinguish between ideal criteria and acceptible deviation.
Validate the input and output of data pipeline components.
Measure progress of data preparation tasks.
On-board new team members with an explicit and structured process.

Datatest can be used directly in your own projects or as part of a testing framework like pytest or unittest. It has no hard dependencies; it's tested on Python 2.6, 2.7, 3.2 through 3.10, PyPy, and PyPy3; and is freely available under the Apache License, version 2.

Documentation:	https://datatest.readthedocs.io/ (stable) https://datatest.readthedocs.io/en/latest/ (latest)
Official:	https://pypi.org/project/datatest/

Code Examples

Validating a Dictionary of Lists

from datatest import validate, accepted, Invalid


data = {
    'A': [1, 2, 3, 4],
    'B': ['x', 'y', 'x', 'x'],
    'C': ['foo', 'bar', 'baz', 'EMPTY']
}

validate(data.keys(), {'A', 'B', 'C'})

validate(data['A'], int)

validate(data['B'], {'x', 'y'})

with accepted(Invalid('EMPTY')):
    validate(data['C'], str.islower)

Validating a Pandas DataFrame

import pandas as pd
from datatest import register_accessors, accepted, Invalid


register_accessors()
df = pd.read_csv('data.csv')

df.columns.validate({'A', 'B', 'C'})

df['A'].validate(int)

df['B'].validate({'x', 'y'})

with accepted(Invalid('EMPTY')):
    df['C'].validate(str.islower)

Installation

The easiest way to install datatest is to use pip:

pip install datatest

If you are upgrading from version 0.11.0 or newer, use the --upgrade option:

pip install --upgrade datatest

Upgrading From Version 0.9.6

If you have an existing codebase of older datatest scripts, you should upgrade using the following steps:

Install datatest 0.10.0 first:

pip install --force-reinstall datatest==0.10.0

Run your existing code and check for DeprecationWarnings.
Update the parts of your code that use deprecated features.
Once your code is running without DeprecationWarnings, install the latest version of datatest:
```
pip install --upgrade datatest
```

Stuntman Mike

If you need bug-fixes or features that are not available in the current stable release, you can "pip install" the development version directly from GitHub:

pip install --upgrade https://github.com/shawnbrown/datatest/archive/master.zip

All of the usual caveats for a development install should apply---only use this version if you can risk some instability or if you know exactly what you're doing. While care is taken to never break the build, it can happen.

Safety-first Clyde

If you need to review and test packages before installing, you can install datatest manually.

Download the latest source distribution from the Python Package Index (PyPI):

https://pypi.org/project/datatest/#files

Unpack the file (replacing X.Y.Z with the appropriate version number) and review the source code:

tar xvfz datatest-X.Y.Z.tar.gz

Change to the unpacked directory and run the tests:

cd datatest-X.Y.Z
python setup.py test

Don't worry if some of the tests are skipped. Tests for optional data sources (like pandas DataFrames or NumPy arrays) are skipped when the related third-party packages are not installed.

If the source code and test results are satisfactory, install the package:

python setup.py install

Supported Versions

Tested on Python 2.6, 2.7, 3.2 through 3.10, PyPy, and PyPy3. Datatest is pure Python and may also run on other implementations as well (check using "setup.py test" before installing).

Backward Compatibility

If you have existing tests that use API features which have changed since 0.9.0, you can still run your old code by adding the following import to the beginning of each file:

from datatest.__past__ import api09

To maintain existing test code, this project makes a best-effort attempt to provide backward compatibility support for older features. The API will be improved in the future but only in measured and sustainable ways.

All of the data used at the National Committee for an Effective Congress has been checked with datatest for several years so there is, already, a large and growing codebase that relies on current features and must be maintained into the future.

Soft Dependencies

Datatest has no hard, third-party dependencies. But if you want to interface with pandas DataFrames, NumPy arrays, or other optional data sources, you will need to install the relevant packages (pandas, numpy, etc.).

Development Repository

The development repository for datatest is hosted on GitHub.

Freely licensed under the Apache License, Version 2.0

Comments

validation errors Extra(nan) or Invalid(nan)
Shaun, I am trying your package to see if I can validate a csv file by reading it in pandas. I am getting Extra(nan) dt.validate.superset() or Invalid(nan) dt.validate() . Is there a way I can include those nan in my validation sets?

Error looks like

E ValidationError: may contain only elements of given superset (10000 differences): [ Extra(nan), Extra(nan), Extra(nan),

Note: I am reading this particular column as str

E ValidationError: does not satisfy 'str' (10000 differences): [ Invalid(nan), Invalid(nan), Invalid(nan), Invalid(nan),

Let me know if you find a solution or can help me debug
opened by upretip 5

Crashes pytest-xdist processes (NOTE: See comments for fix.)

Hi, all! I've got some problem, when start my tests with pytest-xdist

MacOS(Also check in debian) python 3.8.2

pytest==5.4.3 pytest-xdist==1.33.0 datatest==0.9.6

from datatest import accepted, Extra, validate as __validate


def test_should_passed():
    with accepted(Extra):
        __validate({"qwe": 1}, {"qwe": 1}, "")


def test_should_failed():
    with accepted(Extra):
        __validate({"qwe": 1}, {"qwe": 2}, "")


if __name__ == '__main__':
    import sys, pytest
    sys.exit(pytest.main(['/Users/qa/PycharmProjects/qa/test123.py', '-vvv', '-n', '1', '-s']))

Output:

test123.py::test_should_passed 
[gw0] PASSED test123.py::test_should_passed 
test123.py::test_should_failed !!!!!!!!!!!!!!!!!!!! <ExceptionInfo RuntimeError('\'----------------------------------------------------------------------------------------------------\'.../issues\'\n\'----------------------------------------------------------------------------------------------------\'\n') tblen=14>

INTERNALERROR> Traceback (most recent call last):
INTERNALERROR>   File "/Users/qa/PycharmProjects/qa/venv/lib/python3.8/site-packages/xdist/workermanage.py", line 334, in process_from_remote
INTERNALERROR>     rep = self.config.hook.pytest_report_from_serializable(
INTERNALERROR>   File "/Users/qa/PycharmProjects/qa/venv/lib/python3.8/site-packages/pluggy/hooks.py", line 286, in __call__
INTERNALERROR>     return self._hookexec(self, self.get_hookimpls(), kwargs)
INTERNALERROR>   File "/Users/qa/PycharmProjects/qa/venv/lib/python3.8/site-packages/pluggy/manager.py", line 93, in _hookexec
INTERNALERROR>     return self._inner_hookexec(hook, methods, kwargs)
INTERNALERROR>   File "/Users/qa/PycharmProjects/qa/venv/lib/python3.8/site-packages/pluggy/manager.py", line 84, in <lambda>
INTERNALERROR>     self._inner_hookexec = lambda hook, methods, kwargs: hook.multicall(
INTERNALERROR>   File "/Users/qa/PycharmProjects/qa/venv/lib/python3.8/site-packages/pluggy/callers.py", line 208, in _multicall
INTERNALERROR>     return outcome.get_result()
INTERNALERROR>   File "/Users/qa/PycharmProjects/qa/venv/lib/python3.8/site-packages/pluggy/callers.py", line 80, in get_result
INTERNALERROR>     raise ex[1].with_traceback(ex[2])
INTERNALERROR>   File "/Users/qa/PycharmProjects/qa/venv/lib/python3.8/site-packages/pluggy/callers.py", line 187, in _multicall
INTERNALERROR>     res = hook_impl.function(*args)
INTERNALERROR>   File "/Users/qa/PycharmProjects/qa/venv/lib/python3.8/site-packages/_pytest/reports.py", line 355, in pytest_report_from_serializable
INTERNALERROR>     return TestReport._from_json(data)
INTERNALERROR>   File "/Users/qa/PycharmProjects/qa/venv/lib/python3.8/site-packages/_pytest/reports.py", line 193, in _from_json
INTERNALERROR>     kwargs = _report_kwargs_from_json(reportdict)
INTERNALERROR>   File "/Users/qa/PycharmProjects/qa/venv/lib/python3.8/site-packages/_pytest/reports.py", line 485, in _report_kwargs_from_json
INTERNALERROR>     reprtraceback = deserialize_repr_traceback(
INTERNALERROR>   File "/Users/qa/PycharmProjects/qa/venv/lib/python3.8/site-packages/_pytest/reports.py", line 468, in deserialize_repr_traceback
INTERNALERROR>     repr_traceback_dict["reprentries"] = [
INTERNALERROR>   File "/Users/qa/PycharmProjects/qa/venv/lib/python3.8/site-packages/_pytest/reports.py", line 469, in <listcomp>
INTERNALERROR>     deserialize_repr_entry(x) for x in repr_traceback_dict["reprentries"]
INTERNALERROR>   File "/Users/qa/PycharmProjects/qa/venv/lib/python3.8/site-packages/_pytest/reports.py", line 464, in deserialize_repr_entry
INTERNALERROR>     _report_unserialization_failure(entry_type, TestReport, reportdict)
INTERNALERROR>   File "/Users/qa/PycharmProjects/qa/venv/lib/python3.8/site-packages/_pytest/reports.py", line 206, in _report_unserialization_failure
INTERNALERROR>     raise RuntimeError(stream.getvalue())
INTERNALERROR> RuntimeError: '----------------------------------------------------------------------------------------------------'
INTERNALERROR> 'INTERNALERROR: Unknown entry type returned: DatatestReprEntry'
INTERNALERROR> "report_name: <class '_pytest.reports.TestReport'>"
INTERNALERROR> {'$report_type': 'TestReport',
INTERNALERROR>  'duration': 0.002020120620727539,
INTERNALERROR>  'item_index': 1,
INTERNALERROR>  'keywords': {'qa': 1, 'test123.py': 1, 'test_should_failed': 1},
INTERNALERROR>  'location': ('test123.py', 8, 'test_should_failed'),
INTERNALERROR>  'longrepr': {'chain': [({'extraline': None,
INTERNALERROR>                           'reprentries': [{'data': {'lines': ['    def '
INTERNALERROR>                                                               'test_should_failed():',
INTERNALERROR>                                                               '        with '
INTERNALERROR>                                                               'accepted(Extra):',
INTERNALERROR>                                                               '>           '
INTERNALERROR>                                                               '__validate({"qwe": '
INTERNALERROR>                                                               '1}, {"qwe": 2}, '
INTERNALERROR>                                                               '"")',
INTERNALERROR>                                                               'E           '
INTERNALERROR>                                                               'datatest.ValidationError: '
INTERNALERROR>                                                               'does not '
INTERNALERROR>                                                               'satisfy 2 (1 '
INTERNALERROR>                                                               'difference): {',
INTERNALERROR>                                                               'E               '
INTERNALERROR>                                                               "'qwe': "
INTERNALERROR>                                                               'Deviation(-1, '
INTERNALERROR>                                                               '2),',
INTERNALERROR>                                                               'E           }'],
INTERNALERROR>                                                     'reprfileloc': {'lineno': 11,
INTERNALERROR>                                                                     'message': 'ValidationError',
INTERNALERROR>                                                                     'path': 'test123.py'},
INTERNALERROR>                                                     'reprfuncargs': {'args': []},
INTERNALERROR>                                                     'reprlocals': None,
INTERNALERROR>                                                     'style': 'long'},
INTERNALERROR>                                            'type': 'DatatestReprEntry'}],
INTERNALERROR>                           'style': 'long'},
INTERNALERROR>                          {'lineno': 11,
INTERNALERROR>                           'message': 'datatest.ValidationError: does not '
INTERNALERROR>                                      'satisfy 2 (1 difference): {\n'
INTERNALERROR>                                      "    'qwe': Deviation(-1, 2),\n"
INTERNALERROR>                                      '}',
INTERNALERROR>                           'path': '/Users/qa/PycharmProjects/qa/test123.py'},
INTERNALERROR>                          None)],
INTERNALERROR>               'reprcrash': {'lineno': 11,
INTERNALERROR>                             'message': 'datatest.ValidationError: does not '
INTERNALERROR>                                        'satisfy 2 (1 difference): {\n'
INTERNALERROR>                                        "    'qwe': Deviation(-1, 2),\n"
INTERNALERROR>                                        '}',
INTERNALERROR>                             'path': '/Users/qa/PycharmProjects/qa/test123.py'},
INTERNALERROR>               'reprtraceback': {'extraline': None,
INTERNALERROR>                                 'reprentries': [{'data': {'lines': ['    def '
INTERNALERROR>                                                                     'test_should_failed():',
INTERNALERROR>                                                                     '        '
INTERNALERROR>                                                                     'with '
INTERNALERROR>                                                                     'accepted(Extra):',
INTERNALERROR>                                                                     '>           '
INTERNALERROR>                                                                     '__validate({"qwe": '
INTERNALERROR>                                                                     '1}, '
INTERNALERROR>                                                                     '{"qwe": '
INTERNALERROR>                                                                     '2}, "")',
INTERNALERROR>                                                                     'E           '
INTERNALERROR>                                                                     'datatest.ValidationError: '
INTERNALERROR>                                                                     'does not '
INTERNALERROR>                                                                     'satisfy 2 '
INTERNALERROR>                                                                     '(1 '
INTERNALERROR>                                                                     'difference): '
INTERNALERROR>                                                                     '{',
INTERNALERROR>                                                                     'E               '
INTERNALERROR>                                                                     "'qwe': "
INTERNALERROR>                                                                     'Deviation(-1, '
INTERNALERROR>                                                                     '2),',
INTERNALERROR>                                                                     'E           '
INTERNALERROR>                                                                     '}'],
INTERNALERROR>                                                           'reprfileloc': {'lineno': 11,
INTERNALERROR>                                                                           'message': 'ValidationError',
INTERNALERROR>                                                                           'path': 'test123.py'},
INTERNALERROR>                                                           'reprfuncargs': {'args': []},
INTERNALERROR>                                                           'reprlocals': None,
INTERNALERROR>                                                           'style': 'long'},
INTERNALERROR>                                                  'type': 'DatatestReprEntry'}],
INTERNALERROR>                                 'style': 'long'},
INTERNALERROR>               'sections': []},
INTERNALERROR>  'nodeid': 'test123.py::test_should_failed',
INTERNALERROR>  'outcome': 'failed',
INTERNALERROR>  'sections': [],
INTERNALERROR>  'testrun_uid': 'c913bf205a874a50a237dcf40d482d06',
INTERNALERROR>  'user_properties': [],
INTERNALERROR>  'when': 'call',
INTERNALERROR>  'worker_id': 'gw0'}
INTERNALERROR> 'Please report this bug at https://github.com/pytest-dev/pytest/issues'
INTERNALERROR> '----------------------------------------------------------------------------------------------------'
[gw0] node down: <ExceptionInfo RuntimeError('\'----------------------------------------------------------------------------------------------------\'.../issues\'\n\'----------------------------------------------------------------------------------------------------\'\n') tblen=14>
[gw0] FAILED test123.py::test_should_failed 

replacing crashed worker gw0
[gw1] darwin Python 3.8.3 cwd: /Users/qa/PycharmProjects/qa
INTERNALERROR> Traceback (most recent call last):
INTERNALERROR>   File "/Users/qa/PycharmProjects/qa/venv/lib/python3.8/site-packages/_pytest/main.py", line 191, in wrap_session
INTERNALERROR>     session.exitstatus = doit(config, session) or 0
INTERNALERROR>   File "/Users/qa/PycharmProjects/qa/venv/lib/python3.8/site-packages/_pytest/main.py", line 247, in _main
INTERNALERROR>     config.hook.pytest_runtestloop(session=session)
INTERNALERROR>   File "/Users/qa/PycharmProjects/qa/venv/lib/python3.8/site-packages/pluggy/hooks.py", line 286, in __call__
INTERNALERROR>     return self._hookexec(self, self.get_hookimpls(), kwargs)
INTERNALERROR>   File "/Users/qa/PycharmProjects/qa/venv/lib/python3.8/site-packages/pluggy/manager.py", line 93, in _hookexec
INTERNALERROR>     return self._inner_hookexec(hook, methods, kwargs)
INTERNALERROR>   File "/Users/qa/PycharmProjects/qa/venv/lib/python3.8/site-packages/pluggy/manager.py", line 84, in <lambda>
INTERNALERROR>     self._inner_hookexec = lambda hook, methods, kwargs: hook.multicall(
INTERNALERROR>   File "/Users/qa/PycharmProjects/qa/venv/lib/python3.8/site-packages/pluggy/callers.py", line 208, in _multicall
INTERNALERROR>     return outcome.get_result()
INTERNALERROR>   File "/Users/qa/PycharmProjects/qa/venv/lib/python3.8/site-packages/pluggy/callers.py", line 80, in get_result
INTERNALERROR>     raise ex[1].with_traceback(ex[2])
INTERNALERROR>   File "/Users/qa/PycharmProjects/qa/venv/lib/python3.8/site-packages/pluggy/callers.py", line 187, in _multicall
INTERNALERROR>     res = hook_impl.function(*args)
INTERNALERROR>   File "/Users/qa/PycharmProjects/qa/venv/lib/python3.8/site-packages/xdist/dsession.py", line 112, in pytest_runtestloop
INTERNALERROR>     self.loop_once()
INTERNALERROR>   File "/Users/qa/PycharmProjects/qa/venv/lib/python3.8/site-packages/xdist/dsession.py", line 135, in loop_once
INTERNALERROR>     call(**kwargs)
INTERNALERROR>   File "/Users/qa/PycharmProjects/qa/venv/lib/python3.8/site-packages/xdist/dsession.py", line 263, in worker_runtest_protocol_complete
INTERNALERROR>     self.sched.mark_test_complete(node, item_index, duration)
INTERNALERROR>   File "/Users/qa/PycharmProjects/qa/venv/lib/python3.8/site-packages/xdist/scheduler/load.py", line 151, in mark_test_complete
INTERNALERROR>     self.node2pending[node].remove(item_index)
INTERNALERROR> KeyError: <WorkerController gw0>

But if I change second test like this, all works fine:

def test_should_failed():
    try:
        with accepted(Extra):
            __validate({"qwe": 1}, {"qwe": 2}, "")
    except:
        raise ValueError

I don't know exactly where i should create bug\issue about this :)

bug

opened by VasilyevAA 3

AcceptedExtra not working as expected with dicts

I expected with AcceptedExtra(): to ignore missing keys in dicts, but instead it raises a Deviation from None.

Here is an example:

actual = {'a': 1, 'b': 2}
expected = {'b': 2}
with AcceptedExtra():
    validate(actual, requirement=expected)

The output is:

E           ValidationError: does not satisfy mapping requirements (1 difference): {
                'a': Deviation(+1, None),
            }

Thanks for the cool package, by the way!

opened by TheOtherDude 3

Add pytest framework trove classifier

Adding the trove classifier will signal that datatest also acts as a pytest plugin. This will also help https://plugincompat.herokuapp.com to find it and list it as a plugin and do regular installation checks.

For further details see this recently merged PR from hypothesis: https://github.com/HypothesisWorks/hypothesis/pull/1306

opened by obestwalter 3
Magic Reduction
Issue #7 exposes the degree of magic that is currently present in the DataTestCase methods. Removing (or at least reducing) magic where possible would make the behavior easier to understand and explain.

In cases where small amounts of magic are useful, methods should be renamed to better reflect what's happening.

Illustrating the Problem

This "magic" version:

def test_active(self): self.assertDataSet('active', {'Y', 'N'})

...is roughly equivalent to:

def test_active(self): subject = self.subject.set('active') self.assertEqual(subject, {'Y', 'N'})

The magic version requires detailed knowledge about the method before a newcomer can guess what's happening. The later example is more explicit and easier to reason about.

Having said this, the magic versions of DataTestCase's methods can save a lot of typing. So what I plan to do is:

Fully implement assertEqual() integration (see issue #7) as well as other standard unittest methods (assertGreater(), etc.).

Rename the existing methods to clearly denote that they run on the subject data (e.g., assertDataSum() → assertSubjectSum(), etc.).

enhancement
opened by shawnbrown 3
Unique Method

Hey Shawn - one of the problems you were speaking about at PyCon 2016 was looking to guarantee that all integers in a list were unique, in an efficient way for large sets of data?
enhancement

opened by RyPeck 3
Fix syntax of `python_requires`

>=2.6.* isn't valid syntax for python_requires(see PEP 440).

This was causing an alpha release of Poetry to fail to install this package. I think they're going to fix it in future releases, but regardless it'd be helpful if this syntax was fixed.

opened by ajhynes7 2
pytest_runtest_makereport crashes on test exceptions

If an exception is thrown within a test that uses the test_db_engine fixture, the pytest_runtest_makereport function crashes. The reason is that it uses Node's deprecated get_marker function, instead of the new get_closest_marker function. See details about this change in pytest here: https://docs.pytest.org/en/latest/mark.html#updating-code

opened by avshalomt2 2
Explore ways to optimize validation and allowance flow.
Once major pieces are in place, explore ways of optimizing the validation/allowance process. Look to implement the following possible improvements:

Use lazy evaluation in validate and assertion functions by returning generators instead of fully calculated containers.

Create optimized _validate...() functions for faster testing (short-circuit evaluation and Boolean return values) rather than using _compare...() functions in all cases.
opened by shawnbrown 2

Squint objects not handled properly when used as requirements.

Squint objects are not being evaluated properly by datatest.validate() function:

import datatest
import squint

# Create a Select object.
select = squint.Select([['A', 'B'], ['x', 1], ['y', 2], ['z', 3]])

# Compare data to itself--passes as expected.
datatest.validate(
	select({'A': {'B'}}),
	select({'A': {'B'}}).fetch(),  # <- Shouldn't be necessary.
)

# Compare data to itself--fails, unexpectedly.
datatest.validate(
	select({'A': {'B'}}),
	select({'A': {'B'}}),  # <- Not properly handled!
)

In the code above, the second call to datatest.validate() should pass but, instead, fails with the following message:

Traceback (most recent call last):
  File "<input>", line 3, in <module>
	select({'A': {'B'}}),  # <- Not properly handled!
  File "~/datatest-project/datatest/validation.py", line 291, in __call__
	raise err
datatest.ValidationError: does not satisfy mapping requirements (3 differences): {
	'x': [Invalid(1)],
	'y': [Invalid(2)],
	'z': [Invalid(3)],
}

bug

opened by shawnbrown 1

Selector.load_data() silently fails on missing file.
The following should raise an error:

>>> import datatest >>> select = datatest.Selector() >>> select = select.load_data('nonexistent_file.csv')
bug
opened by shawnbrown 1
How to validate Pandas data type "Int64"?

Pandas recently introduced IntegerArrays which allow integer types to also store a NaN-like value pandas.NA.

Is there a way to use datatest to validate that a pandas.DataFrame's column is of type Int64, i.e. all values are of that type.

I tried df["mycolumn"].validate(pd.arrays.IntegerArray) and df["mycolumn"].validate(pd.Int64Dtype) to no avail.

opened by PanCakeConnaisseur 0
Understanding Pandas validation
Hello, apologies if this is the wrong place to ask this question.

I am stumped on how datatest's validation mechanism is passing the following example:

dt.validate(pd.DataFrame(), pd.DataFrame({"A": [1]})

The documentation states:

For validation, DataFrame objects using the default index type are treated as sequences.

Shouldn't I be getting the same result as dt.validate([], [1])? What am I missing?
opened by schlich 1
Improve existing or create another Deviation-like difference

Hello @shawnbrown It would be nice to also show actual value along with deviation and expected value. It would also be nice to be able to see the percentage deviation along with the absolute deviation. Thanks!

opened by a-chernov 0
Improve error message for @working_directory decorator
If working_directory() is used as a decorator but the developer forgets to call it with a path, the error message can be confusing because the function is passed in implicitly (via decorator handling):

>>> from datatest import working_directory >>> >>> @working_directory >>> def foo(): >>> return True ... TypeError: stat: path should be string, bytes, os.PathLike or integer, not function

This misuse is easily detectable in the code and it would be good to improve the error message to help users understand their mistake.
opened by shawnbrown 0
NaT issue

Greetings, @shawnbrown

to be short,

my pd.Series is like: Date 0 NaT 1 NaT 2 NaT 3 2010-12-31 4 2010-12-31 Name: Date, dtype: datetime64[ns] the type of NaT is: <class 'pandas._libs.tslibs.nattype.NaTType'> when I use the following code:

with accepted(Extra(pd.NaT)): validate(data, requirement)

I found that it the NaTs can not be recognized. I tried many types of Extra and tried using function but all faild.

here I need your help. Thanks for your work.

opened by Belightar 5
Investigate Support for DataFrame-Protocol

Keep an eye on wesm/dataframe-protocol#1 and see if it makes sense to change datatest's normalization to support a DataFrame-protocol instead of Dataframes specifically.
enhancement

opened by shawnbrown 0

Releases(0.11.1)

0.11.1(Jan 4, 2021)
Fixed validation, predicate, and difference handling of non-comparable objects.

Fixed bug in normalization of Queries from squint package.

Changed failure output to improve error reporting with pandas accessors.

Changed predicate failure message to quote code objects using backticks.

Source code(tar.gz)
Source code(zip)
0.11.0(Dec 18, 2020)
Removed deprecated decorators: skip(), skipIf(), skipUnless() (use unittest.skip(), etc. instead).

Removed deprecated aliases Selector and ProxyGroup.

Removed the long-deprecated allowed interface.

Removed deprecated acceptances: "specific", "limit", etc.

Removed deprecated Select, Query, and Result API. Use squint instead:

https://pypi.org/project/squint/

Removed deprecated get_reader() function. Use get-reader instead:

https://pypi.org/project/get-reader/

Source code(tar.gz)
Source code(zip)
0.10.0(Dec 17, 2020)
Fixed bug where ValidationErrors were crashing pytest-xdist workers.

Added tighter Pandas integration using Pandas' extension API.

After calling the new register_accessors() function, your existing DataFrame, Series, Index, and MultiIndex objects will have a validate() method that can be used instead of the validate() function:

import padas as pd import datatest as dt dt.register_accessors() # <- Activate Pandas integration. df = pd.DataFrame(...) df[['A', 'B']].validate((str, int)) # <- New accessor method.

Changed Pandas validation behavior:

DataFrame and Series: These objects are treated as sequences when they use a RangeIndex index (this is the default type assigned when no index is specified). And they are treated as dictionaries when they use an index of any other type--the index values become the dictionary keys.

Index and MultiIndex: These objects are treated as sequences.

Changed repr behavior of Deviation to make timedeltas more readable.

Added Predicate matching support for NumPy types np.character, np.integer, np.floating, and np.complexfloating.

Added improved NaN handling:

Added NaN support to accepted.keys(), accepted.args(), and validate.interval().

Improved existing NaN support for difference comparisons.

Added how-to documentation for NaN handling.

Added data handling support for squint.Select objects.

Added deprecation warnings for soon-to-be-removed functions and classes:

Added DeprecationWarning to get_reader function. This function is now available from the get-reader package on PyPI:

https://pypi.org/project/get-reader/

Added DeprecationWarning to Select, Query, and Result classes. These classes will be deprecated in the next release but are now available from the squint package on PyPI:

https://pypi.org/project/squint/

Changed validate.subset() and validate.superset() behavior:

The semantics are now inverted. This behavior was flipped to more closely match user expectations. The previous semantics were used because they reflect the internal structure of datatest more precisely. But these are implementation details that and they are not as important as having a more intuitive API.

Added temporary a warning when using the new subset superset methods to alert users to the new behavior. This warning will be removed from future versions of datatest.

Added Python 3.9 and 3.10 testing and support.

Removed Python 3.1 testing and support. If you were still using this version of Python, please email me--this is a story I need to hear.

Source code(tar.gz)
Source code(zip)

0.9.6(Jun 3, 2019)

Changed acceptance API to make it both less verbose and more expressive:

Consolidated specific-instance and class-based acceptances into a single interface.
Added a new accepted.tolerance() method that subsumes the behavior of accepted.deviation() by supporting Missing and Extra quantities in addition to Deviation objects.

Deprecated old methods:

Old Syntax	New Syntax
`accepted.specific(...)`	`accepted(...)`
`accepted.missing()`	`accepted(Missing)`
`accepted.extra()`	`accepted(Extra)`
NO EQUIVALENT	`accepted(CustomDifferenceClass)`
`accepted.deviation(...)`	`accepted.tolerance(...)`
`accepted.limit(...)`	`accepted.count(...)`
NO EQUIVALENT	`accepted.count(..., scope='group')`

Other methods--accepted.args(), accepted.keys(), etc.--remain unchanged.

Changed validation to generate Deviation objects for a broader definition of quantitative values (like datetime objects)--not just for subclasses of numbers.Number.
Changed handling for pandas.Series objects to treat them as sequences instead of mappings.
Added handling for DBAPI2 cursor objects to automatically unwrap single-value rows.
Removed acceptance classes from datatest namespace--these were inadvertently added in a previous version but were never part of the documented API. They can still be referenced via the acceptances module:

from datatest.acceptances import ...

Source code(tar.gz)
Source code(zip)

0.9.5(May 1, 2019)
Changed difference objects to make them hashable (can now be used as set members or as dict keys).

Added __slots__ to difference objects to reduce memory consumption.

Changed name of Selector class to Select (Selector now deprecated).

Changed language and class names from allowed and allowance to accepted and acceptance to bring datatest more inline with manufacturing and engineering terminology. The existing allowed API is now deprecated.

Source code(tar.gz)
Source code(zip)
0.9.4(Apr 21, 2019)
Added Python 3.8 testing and support.

Added new validate methods (moved from how-to recipes into core module):

Added approx() method to require for approximate numeric equality.

Added fuzzy() method to require strings by approximate match.

Added interval() method to require elements within a given interval.

Added set(), subset(), and superset() methods for explicit membership checking.

Added unique() method to require unique elements.

Added order() method to require elements by relative order.

Changed default sequence validation to check elements by index position rather than checking by relative order.

Added fuzzy-matching allowance to allow strings by approximate match.

Added Predicate class to formalize behavior--also provides inverse-matching with the inversion operator (~).

Added new methods to Query class:

Added unwrap() to remove single-element containers and return their unwrapped contents.

Added starmap() to unpack grouped arguments when applying a function to elements.

Fixed improper use of assert statements with appropriate conditional checks and error behavior.

Added requirement class hierarchy (using BaseRequirement). This gives users a cleaner way to implement custom validation behavior and makes the underlying codebase easier to maintain.

Changed name of ProxyGroup to RepeatingContainer.

Changed "How To" examples to use the new validation methods.

Source code(tar.gz)
Source code(zip)
0.9.3(Jan 29, 2019)
Changed bundled pytest plugin to version 0.1.3:

This update adds testing and support for latest versions of Pytest and Python (now tested using Pytest 3.3 to 4.1 and Python 2.7 to 3.7).

Changed handling for 'mandatory' marker to support older and newer Pytest versions.

Source code(tar.gz)
Source code(zip)
0.9.2(Aug 8, 2018)
Improved data handling features and support for Python 3.7:

Changed Query class:

Added flatten() method to serialize dictionary results.

Added to_csv() method to quickly save results as a CSV file.

Changed reduce() method to accept initializer_factory as an optional argument.

Changed filter() method to support predicate matching.

Added True and False as predicates to support "truth value testing" on arbitrary objects (to match on truthy or falsy).

Added ProxyGroup class for performing the same operations on groups of objects at the same time (a common need when testing against reference data).

Changed Selector class keyword filtering to support predicate matching.

Added handling to get_reader() to support datatest's Selector and Result objects.

Fixed get_reader() bug that prevented encoding-fallback recovery when reading from StringIO buffers in Python 2.

Source code(tar.gz)
Source code(zip)
0.9.1(Jun 22, 2018)
Added impoved docstrings and other documentation.

Changed bundled pytest plugin to version 0.1.2:

Added handling for a mandatory marker to support incremental testing (stops session early when a mandatory test fails).

Added --ignore-mandatory option to continue tests even when a mandatory test fails.

Source code(tar.gz)
Source code(zip)
0.9.0(Apr 29, 2018)
Added bundled version pytest plugin to base installation.

Added universal composability for all allowances (using UNION and INTERSECTION via "|" and "&" operators).

Added allowed factory class to simplify allowance imports.

Changed is_valid() to valid().

Changed ValidationError to display differences in sorted order.

Added Python 2 and 3 compatible get_reader() to quickly load csv.reader-like interface for Unicode CSV, MS Excel, pandas.DataFrame, DBF, etc.

Added formal order of operations for allowance resolution.

Added formal predicate object handling.

Added Sphinx-tabs style docs for clear separation of pytest and unittest style examples.

Changed DataSource to Selector, DataQuery to Query, and DataResult to Result.

Source code(tar.gz)
Source code(zip)
0.8.3(Nov 26, 2017)
New module-level functions: validate() and is_valid().

DataQuery selections now default to a list type when no outer-container is specified.

New DataQuery.apply() method for group-wise function application.

DataSource.fieldnames attribute is now a tuple (was a list).

The ValidationError repr now prints a trailing comma with the last item (for ease of copy-and-paste work flow).

Revised sequence validation behavior provides more precise differences.

New truncation support for ValidationErrors with long lists of differences.

Excess differences in allowed_specific() definitions no longer trigger test failures.

New support for user-defined functions to narrow DataSource selections.

Better traceback hiding for pytest.

Fix bug in DataQuery.map() method--now converts set types into lists.

Source code(tar.gz)
Source code(zip)
0.8.2(Jun 11, 2017)
Implement Boolean composition for allowed_specific() context manager.

Add proper __repr__() support to DataSource and DataQuery.

Make sure DataQuery fails early if bad "select" syntax is used or if unknown columns are selected.

Add __copy__() method to DataQuery.

Change parent class of differences so they no longer inherit from Exception (this confused their intended use).

Restructure documentation for ease of reference.

Source code(tar.gz)
Source code(zip)
0.8.1(May 31, 2017)
Updated DataQuery select behavior to fail immediately when invalid syntax is used (rather than later when attempting to execute the query).

Improved error messages to better explain what went wrong.

Source code(tar.gz)
Source code(zip)
0.8.0(May 31, 2017)
Replaces old assertion methods with a single, smarter assertValid() method.

DataQuery implements query optimization and uses a simpler and more expressive syntax.

Allowances and errors have been reworked to be more expressive.

Allowances are now composeable with bit-wise "&" and "|" operators.

Source code(tar.gz)
Source code(zip)
0.7.0.dev2(Aug 3, 2016)
Removes some of the internal magic and renames data assertions to more clearly indicate their intended use.

Restructures data allowances to provide more consistent parameters and more flexible usage.

Adds new method to assert unique values.

Adds full **fmtparams support for CSV handling.

Fixes comparison and allowance behavior for None vs. zero.

Source code(tar.gz)
Source code(zip)
0.6.0.dev1(May 29, 2016)

First public release.
Source code(tar.gz)
Source code(zip)

Owner

GitHub Repository

Donors data of Tamil Nadu Chief Ministers Relief Fund scrapped from https://ereceipt.tn.gov.in/cmprf/Interface/CMPRF/MonthWiseReport

Tamil Nadu Chief Minister's Relief Fund Donors Scrapped data from https://ereceipt.tn.gov.in/cmprf/Interface/CMPRF/MonthWiseReport Scrapper scrapper.p

5 May 18, 2021

Tools for test driven data-wrangling and data validation.

Related tags

Overview

datatest: Test driven data-wrangling and data validation

Code Examples

Validating a Dictionary of Lists

Validating a Pandas DataFrame

Installation

Upgrading From Version 0.9.6

Stuntman Mike

Safety-first Clyde

Supported Versions

Backward Compatibility

Soft Dependencies

Development Repository

Comments

Illustrating the Problem

Releases(0.11.1)

0.11.1(Jan 4, 2021)

0.11.0(Dec 18, 2020)

0.10.0(Dec 17, 2020)

0.9.6(Jun 3, 2019)

0.9.5(May 1, 2019)

0.9.4(Apr 21, 2019)

0.9.3(Jan 29, 2019)

0.9.2(Aug 8, 2018)

0.9.1(Jun 22, 2018)

0.9.0(Apr 29, 2018)

0.8.3(Nov 26, 2017)

0.8.2(Jun 11, 2017)

0.8.1(May 31, 2017)

0.8.0(May 31, 2017)

0.7.0.dev2(Aug 3, 2016)

0.6.0.dev1(May 29, 2016)

Owner

catsim - Computerized Adaptive Testing Simulator

Aplikasi otomasi klik di situs popcat.click menggunakan Python dan Selenium

pytest plugin for manipulating test data directories and files

Turn any OpenAPI2/3 and Postman Collection file into an API server with mocking, transformations and validations.

0hh1 solver for the web (selenium) and also for mobile (adb)

Percy visual testing for Python Selenium

Photostudio是一款能进行自动化检测网页存活并实时给网页拍照的工具，通过调用Fofa/Zoomeye/360qua/shodan等 Api快速准确查询资产并进行网页截图，从而实施进一步的信息筛查。

A cross-platform GUI automation Python module for human beings. Used to programmatically control the mouse & keyboard.

Python Moonlight (Machine Learning) Practice

tidevice can be used to communicate with iPhone device

Youtube Tool using selenium Python

Python selenium script to bypass simaster.ugm.ac.id weak captcha.

pytest plugin providing a function to check if pytest is running.

Set your Dynaconf environment to testing when running pytest

HTTP load generator, ApacheBench (ab) replacement, formerly known as rakyll/boom

Screenplay pattern base for Python automated UI test suites.

Generates realistic traffic for load testing tile servers

Divide full port scan results and use it for targeted Nmap runs

Aioresponses is a helper for mock/fake web requests in python aiohttp package.

Donors data of Tamil Nadu Chief Ministers Relief Fund scrapped from https://ereceipt.tn.gov.in/cmprf/Interface/CMPRF/MonthWiseReport