Soda SQL Data testing, monitoring and profiling for SQL accessible data.

Last update: Jan 01, 2023

Overview

Soda SQL

Data testing, monitoring and profiling for SQL accessible data.

What does Soda SQL do?

Soda SQL allows you to

Stop your pipeline when bad data is detected
Extract metrics and column profiles through super efficient SQL
Full control over metrics and queries through declarative config files

Why Soda SQL?

To protect against silent data issues for the consumers of your data, it's best-practice to profile and test your data:

as it lands in your warehouse,
after every important data processing step
right before consumption.

This way you will prevent delivery of bad data to downstream consumers. You will spend less time firefighting and gain a better reputation.

How does Soda SQL work?

Soda SQL is a Command Line Interface (CLI) and a Python library to measure and test your data using SQL.

As input, Soda SQL uses YAML configuration files that include:

SQL connection details
What metrics to compute
What tests to run on the measurements

Based on those configuration files, Soda SQL will perform scans. A scan performs all measurements and runs all tests associated with one table. Typically a scan is executed after new data has arrived. All soda-sql configuration files can be checked into your version control system as part of your pipeline code.

Want to try Soda SQL? Head over to our '5 minute tutorial' and get started straight away!

"Show me the money"

Simple metrics and tests can be configured in YAML configuration files called scan.yml. An example of the contents of such a file:

metrics:
    - row_count
    - missing_count
    - missing_percentage
    - values_count
    - values_percentage
    - valid_count
    - valid_percentage
    - invalid_count
    - invalid_percentage
    - min
    - max
    - avg
    - sum
    - min_length
    - max_length
    - avg_length
columns:
    ID:
        metrics:
            - distinct
            - duplicate_count
        valid_format: uuid
        tests:
            duplicate_count == 0
    CATEGORY:
        missing_values:
            - N/A
            - No category
        tests:
            missing_percentage < 3
    SIZE:
        tests:
            max - min < 20

Metrics aren't limited to the ones defined by Soda SQL. You can create your own custom SQL metric definitions with a simple yml file.

sql: |
    SELECT sum(volume) as total_volume_us
    FROM CUSTOMER_TRANSACTIONS
    WHERE country = 'US'
tests:
    - total_volume_us > 5000

Based on these configuration files, Soda SQL will scan your data each time new data arrived like this:

$ soda scan ./soda/metrics my_warehouse my_dataset
Soda 1.0 scan for dataset my_dataset on prod my_warehouse
  | SELECT column_name, data_type, is_nullable
  | FROM information_schema.columns
  | WHERE lower(table_name) = 'customers'
  |   AND table_catalog = 'datasource.database'
  |   AND table_schema = 'datasource.schema'
  - 0.256 seconds
Found 4 columns: ID, NAME, CREATE_DATE, COUNTRY
  | SELECT
  |  COUNT(*),
  |  COUNT(CASE WHEN ID IS NULL THEN 1 END),
  |  COUNT(CASE WHEN ID IS NOT NULL AND ID regexp '\b[0-9a-f]{8}\b-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-\b[0-9a-f]{12}\b' THEN 1 END),
  |  MIN(LENGTH(ID)),
  |  AVG(LENGTH(ID)),
  |  MAX(LENGTH(ID)),
  | FROM customers
  - 0.557 seconds
row_count : 23543
missing   : 23
invalid   : 0
min_length: 9
avg_length: 9
max_length: 9

...more queries...

47 measurements computed
23 tests executed
All is good. No tests failed. Scan took 23.307 seconds

The next step is to add Soda SQL scans in your favorite data pipeline orchestration solution like:

Airflow
AWS Glue
Prefect
Dagster
Fivetran
Matillion
Luigi

If you like the goals of this project, encourage us! Star sodadata/soda-sql on Github.

Next, head over to our '5 minute tutorial' and get your first project going!

Comments

Snowflake dialect implementation issue

Hi guys,

I've tried using SodaSQL with Snowflake, but I've had an issue with soda init.

First, I've run soda create ., which created the default warehouse.yml. I modified it to look like the following:

name: .
connection:
  type: snowflake
  username: MYUSERNAME
  password: env_var(SNOWFLAKE_PWD)
  account: MYACCOUNTNAME
  database: MYDB
  warehouse: MYWH
  schema: MYSCHEMA

In ~/.soda/env_vars.yml I have:

.:
  SNOWFLAKE_PWD: '<my secret password>'

When I run soda init . I get the following error:

/Users/alessandro.lollo/.pyenv/versions/3.8.3/envs/playground/lib/python3.8/site-packages/pandas/compat/__init__.py:120: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError.
  warnings.warn(msg)
/Users/alessandro.lollo/.pyenv/versions/3.8.3/envs/playground/lib/python3.8/site-packages/snowflake/connector/options.py:78: UserWarning: You have an incompatible version of 'pyarrow' installed (2.0.0), please install a version that adheres to: 'pyarrow<0.18.0,>=0.17.0; extra == "pandas"'
  warn_incompatible_dep('pyarrow', _installed_pyarrow_version.version, _expected_pyarrow_version)
  | Querying warehouse for tables
  | Exception: TODO override and implement this abstract method
Traceback (most recent call last):
  File "/Users/alessandro.lollo/.pyenv/versions/3.8.3/envs/playground/lib/python3.8/site-packages/sodasql/cli/cli.py", line 196, in init
    warehouse.dialect.sql_tables_metadata_query())
  File "/Users/alessandro.lollo/.pyenv/versions/3.8.3/envs/playground/lib/python3.8/site-packages/sodasql/scan/dialect.py", line 98, in sql_tables_metadata_query
    raise RuntimeError('TODO override and implement this abstract method')
RuntimeError: TODO override and implement this abstract method

A part from pyarrow and lzma warnings, it seems that the sql_tables_metadata_query method is not implemented for Snowflake. I have not looked into SodaSQL code, though.

opened by AlessandroLollo 11

[Telemetry] Add anonymous usage and profiling statistics collection
Goal

In order to understand the usage, adoption, and performance of our tool better we should collect usage and performance statistics. What we want to achieve with those are the following goals:

Pro-actively track bugs and find and fix errors to get ahead of issues faster

Manage performance to keep our tools fast and secure,

Troubleshoot issues, speed up resolution, and minimize any downtime for users

Understand how you user use our tools to improve user flow and experience

Those statistics must be strictly anonymous, will not pertain to any private information, and will be limited to tracking events that emanate from the tool itself.

What do we consider private and will never track?

Personally identifiable information (names, email addresses, credentials, geolocation, IP, etc.)

Organizations' or Users' Proprietary information (e.g. name of checks, datasets, test data, or test results content)

Target Release: v2.2.0

Requirements

Users should be able to opt-out at any point via an easy to configure switch (e.g. YAML variable in config.yaml)

Documentation pages are updated with reference to this GitHub issue along with goal and rationale digest and how to opt-out of the tool.

Telemetry events should not significantly hurt the performance of the tool (i.e. we need to make sure we use an async or non-blocking framework/API for that).

Proposed Solution

Event Collection Framework

Two generally accepted frameworks come to mind:

Open Telemetry: built on open standards, easy to configure, suitable for tracing of performance (traces and spans) with high granularity.

Snowplow: open source option, used by a lot of organizations and reliable, little heavier setup, only made for behavioral events so tracing around performance would have to be shoehorned in their data models.

Event Schema & Attributes

Names and implementations are just ideas. Let's challenge those in this issue and we can update this once we're set.

user_cookie_id: we could generate a UUID and store it in ~/.soda/config.yaml or in a different file in that config folder.

command(_ran): we could get it as a trace name or just add it as an attribute in the attributes key, value object. We should be able to capture this from click or API method name.

command_options: these should be flags used if we're talking about a CLI invocation. Can be captured from click but we might want to find a more internal way of capturing it from method signatures of class attributes so that we are ready for API-driven usage in the future.

version: which version of soda_sql is this coming from: https://github.com/sodadata/soda-sql/blob/main/core/sodasql/version.py or whatever else makes sense.

datasource_type: other names could be dialect_id but it lends itself less to soda-spark or other data source types so maybe datasource_type is fine.

datasource_id: given the importance of knowing how many projects and teams are using the tool, we should add a dataserouce_id which should be a hash of the connection parameter that best uniquely identifies a group of people working on the same project.

Postgres, Redshift, Spark, Presto: host

Snowflake: account

BigQuery: project

etc. to be determined on a dialect/supported data source basis.

see a similar rationale from dbt

~- run_status: the alternative name could be invocation_status and would allow us to know if a scan, analyze etc. executed successfully or not (exception types should be logged as well).~ ~- metrics_count: number of metrics collected/derived in a scan.~ ~- metrics_invoked: which metrics are people using. NOTE: we should not log custom metrics derivation code (i.e. the SQL query of a custom metric or any formula as this is considered proprietary information). When uses use custom_sql we should just know that they used a custom metric.~ ~- tests_count: count of tests evaluated.~ ~- tests_invoked: same reasoning as metrics list, if the tests are built-in and do not reveal any proprietary logic we should have a list, if not let's leave it out.~

sql_metrics_count: see comment about metrics_count. SQL metrics contain proprietary business logic so we absolutely do not want to track the SQL query itself but just the fact that a custom metric was used.❓

historic_metrics_count ❓

architecture: we can use the platform library of os library to get at which architecture we're running on

operating_system: see above.

python_version

python_implementation

invocation_start: span start/end in the OT framework

invocation_end: same as above.

soda-sql
opened by bastienboutonnet 11
Update setup.py with git based build/release tagging
Currently building the local version uses a hardcoded __dev__ as a version, it is better to use the git SHA/ref as a version.

[ ] update setup.py to pickup the git ref and set SODA_SQL_VERSION

[ ] Update documentation on how to build/install locally

[ ] Make sure that CI/CD uses the same process
opened by vijaykiran 9
Split soda-sql in separate modules
Folder structure to be discussed

Libraries use dashes in the names, no underscores. Folder names as indicated below all use underscores.

+ src + sodasql (why was this again?) + setup.py + sodasql_cloud + setup.py + sodasql_cloud + tests + sodasql_scan + setup.py + sodasql_scan + tests + sodasql_cli + setup.py + sodasql_cli + tests + sodasql_airflow + connectors + sodasql_postgres + setup.py + sodasql_postgres + tests + sodasql_snowflake + sodasql_athena + sodasql_redshift + sodasql_bigquery + sodasql_spark

Also, before starting I would like to understand how we'll deal with versioning the individual libs and how this impacts the release process.

TODO investigate if dialects can be extracted. Passing data from core lib to dialects is no prob. But The dialects should not invoke methods on core stuff. That would cause circular dependencies. To be investigated before starting this.
enhancement soda-sql
opened by tombaeyens 9
soda-sql doesn't show the test names of the tests executed during a scan.

Hi team,

Soda-SQL doens't show the test names of the test executed during a scan. It does so for measurements though. Please see the output below:

(.venv) [email protected] soda-sql-project % soda scan ./snowflake-soda-warehouse ORDERS
| 2.0.0b5 | Scanning ORDERS in ./snowflake-soda-warehouse ... | sql_metric_path scan.yml does not exist | Executing SQL query: (...) | missing_count(O_ORDERDATE) = 0 | values_percentage(O_ORDERDATE) = 100.0 | 62 measurements computed | 1 tests executed | All is good. No tests failed.

It would be handy to show the test names and results instead of just showing x tests executed and x failed. This would allow the user to immediately see what's wrong and act upon it accordingly.

Thanks!
good first issue soda-sql

opened by mathissedestrooper 8
Allow users to ingest dbt test results into Soda Cloud via soda-sql
Context

Users who use dbt have spent time crafting tests in their transformations. Because we don’t want users to repeat themselves to benefit from Soda Cloud’s visibility and incident management we plan to:

Ingest dbt test results into Soda Cloud.

Provide clear indications of whether the tests are maintained in soda-sql or ingested from dbt

Allow collaboration and reporting to be done on those results just as if they were maintained from soda-sql

User Flow

Users should be able to call a command such as soda ingest --tool dbt, provide soda-sql with paths to the dbt artifacts, and have their tests results show up in the Soda Cloud UI. We will have to be clear in our documentation that ingestion should be orchestrated once dbt has been invoked and the transformations have completed. Soda-sql (at least in this first version) will not integrate with the core dbt API. This means it will not orchestrate a dbt run or be called by dbt. Those scenarios may be revisited in further iterations of our integration.

We may also consider creating a new soda command such as soda capture which would run soda scan and follow it with soda ingest or a flag on soda scan such as --ingest dbt. Up for discussion and TBD once minimal isolated approach is working. This is sugar coating at this point.

Overall Approach

In this v1, we want to keep it simple. dbt generates two valuable artifacts: manifest.json and run_results.json. When it comes to test results, we can parse the results from the run_results.json and use the info in the manifest.json to gain knowledge about the table (dataset in Soda's language) that it maps on to.

Once we have parsed all the info we need (see below) we can compose a TestObject similar (but not exactly) to the soda-sql TestObject that is then exposed to the Soda Cloud backend API. The UI will show the ingested tests on the dataset view, those tests will be clearly identified as external and display a dbt logo or whatever UI choice makes sense.

Specific Requirements

We will want to parse and represent the following information from dbt test results:

test_run_at: timestamp of when the test was run.

test_result: whether the test is passing or failing.

severity: if this info is available (I know it is for freshness but maybe not for all).

info that allows us to generate a fully qualified table name:

database

schema

table

tool_name: in this case it would be dbt (this information will be used by the Soda Cloud UI to display the test provenance and govern available attributes and details)

compiled_sql: dbt should give us access to the compiled sql snippet that it ran for the test. We may want to parse this as we can show it in the Soda Cloud UI and it is valuable information that users can use as a starting point for investigation and issue resolution. If users have enabled store_failures we should be able to obtain the select statement where they can visualise those failures. If this is possible we could consider adding those to the requirements. More info on this can be found here: https://docs.getdbt.com/docs/building-a-dbt-project/tests#storing-test-failures

Concious Limitations

v1 will only support "generic tests": this means dbt builtins as well as the ones implemented via macros and added to models or columns of those models.

v1 will not support "bespoke tests" unless they are following the best practice of {{ ref() }} in their declaration and in that case the mapping should be available in a structured way just like any other tests (manifest.json and `run_results.json)

v1 will not attempt to capture results from store_failures (this may be something we offer as a stretch goal as if fits our failed_rows model (see last point of Specific Requirements

v1 will be limited to ingesting results from dbt-core (OSS). A future iteration will know how to talk to the dbt Cloud API to leverage similar information.

soda-sql
opened by bastienboutonnet 8
Auto-complete soda CLI commands

Is your feature request related to a problem? Please describe. I'm constantly trying to auto-complete soda commands by pressing "tab".

Describe the solution you'd like Auto-complete soda-commands.

Additional context This is definitely not a crucial feature request but could be a nice-to-have in the future 😃

Soda SQL Version: Soda CLI version 2.0.0b21
enhancement soda-sql

opened by mathissedestrooper 7
BigQuery documentation is misleading
Describe the bug Our documentation currently says:

name: my_bigquery_project connection: type: bigquery account_info: <PATH TO YOUR BIGQUERY ACCOUNT INFO JSON FILE> dataset: sodasql ...

but our code expects

account_info_json, and

that property should have JSON object, not the location of it

OS: Python Version: * Soda SQL Version: main Warehouse Type: BigQuery
bug good first issue soda-sql
opened by milanaleksic 7
ImportError when running soda from CLI
Describe the bug When running soda from the CLI I get the following traceback:

Traceback (most recent call last): File "/usr/lib/command-not-found", line 28, in <module> from CommandNotFound import CommandNotFound File "/usr/lib/python3/dist-packages/CommandNotFound/CommandNotFound.py", line 19, in <module> from CommandNotFound.db.db import SqliteDatabase File "/usr/lib/python3/dist-packages/CommandNotFound/db/db.py", line 5, in <module> import apt_pkg ImportError: libapt-pkg.so.6.0: cannot open shared object file: No such file or directory

To Reproduce Steps to reproduce the behavior:

sudo apt-get install g++ unixodbc-dev python3-dev libssl-dev libffi-dev

sudo apt-get install build-essential cargo

pip3 install soda-sql-postgresql

soda

Context Running Vagrant+Virtualbox with the attached vagrantfile Vagrantfile.zip

OS: Ubuntu 18.04.5 LTS Python Version: 3.8.11 Soda SQL Version: unable to get info due to error. Should be current as of 07/08/21 Warehouse Type: PostgreSQL
bug soda-sql
opened by mando222 7
Issue with the pyodbc depenency when installing on a Mac with M1 processor
Describe the bug There is an issue when installing on a Mac with the M1 processor and it seems to be related to the pyodbc dependency. It fails with this error message: clang: error: linker command failed with exit code 1 (use -v to see invocation) error: command '/usr/bin/clang++' failed with exit code 1

The issue with pyodbc is described here: https://github.com/mkleehammer/pyodbc/issues/846

I have tested the workarounds mentioned but I am still unable to get past the pyodbc issue.

To Reproduce Steps to reproduce the behavior:

Just run: pip3 install soda-sql

Context Include your scan.yml or warehouse.yml when relevant

OS: macOs Big Sur 11.2.3 Python Version: 3.9.2 Soda SQL Version: 2.0.0b25 Warehouse Type:
bug soda-sql
opened by albinkjellin 6
[bug] Analyze does not push down the LIMIT clause and results a full scan
Describe the bug In the analyze phase, the DatasetAnalyzer runs some count queries using a LIMIT clause to avoid the full scan. But when I check the query plan, I see that the limit is applied after the results are calculated and that causes a big performance issue for big tables. Limit should be applied before executing the count/sum operators

To Reproduce Steps to reproduce the behavior:

run soda analyze

Run explain from a sql client for the query generated by the analyzer.

This is the current plan: (for demodata dataset)

Limit (cost=2.30..2.31 rows=1 width=16) -- -> Aggregate (cost=2.30..2.31 rows=1 width=16) -> Seq Scan on demodata (cost=0.00..1.65 rows=65 width=13)

It should be:

Aggregate (cost=0.05..0.06 rows=1 width=16) -- -> Limit (cost=0.00..0.03 rows=1 width=1569) -> Seq Scan on demodata (cost=0.00..1.65 rows=65 width=1569)

One way of fixing the problem is changing the FROM demodata limit 1000 tofrom (select * from demodata limit 1) T;

Context Include your scan.yml or warehouse.yml when relevant

OS: Python Version: Soda SQL Version: 2.0.0.b15 Warehouse Type: Redshift + postgresql
bug soda-sql
opened by mmigdiso 6
Incompatible with `Airflow` due to `MarkupSafe` conflict
Describe the bug soda-core:2.2.2 is incompatible with Airflow:2.4.1 due to MarkupSafe version conflict

... soda-sql-bigquery (>=2.1.6,<3.0.0) requires markupsafe (2.0.1). And because apache-airflow (2.4.1) requires MarkupSafe (>=2.1.1) (2), apache-airflow (2.4.1) is incompatible with soda-sql-bigquery (>=2.1.6,<3.0.0)

To Reproduce Steps to reproduce the behavior:

create pyproject.toml with dependencies to

airflow = "2.4.1"

soda-sql-bigquery = "^2.1.6"

any other soda-sql-xxx with same version

run poetry update sod-sql-bigquery ...

observe conflict

OS: MacOs Python Version: 3.9 Soda SQL Version: 2.2.2
bug soda-sql
opened by KulykDmytro 0
#208: fix soda analyze error on mysql

This is for fixing the errors in #208.

The error happens when there is a column name with the mysql keyword and in lower case. For example, a column named 'key'.

After this fix, the issue appeared to be fixed for me.

opened by hgbink 1
Soda Core | checks.yml for all Tables in Database
Goal: run checks.yml on all Tables in Database, implicitly / dynamically (not naming 100s of Tables).

Following Soda's quick start, I've completed sections:

Install Soda Core

Connect Soda Core to a data source - configuration.yml

Now I'm following Write a check and run a scan - checks.yml.

Problem

However, the documentation only gives examples for checking one Table each.

4 Checks

Sum of Tables (in Database)

Sum of Columns (across all Tables, in Database)

Sum of Tables' descriptions exist

Sum of Columns' descriptions exist

Queries return a COUNT().

So far, checks.yml:

# checks for MY_DATABASE: sql_metrics: name: num_tables, num_columns sum_tables query: | SELECT COUNT(*) FROM information_schema.tables WHERE table_schema = '*'; sum_columns query: | SELECT COUNT(*) FROM information_schema.columns WHERE table_name = '*'; sum_tables_descriptions query: | -- SQL sum_columns_descriptions query: | -- SQL
bug soda-sql
opened by danielbellhv 2

Applying a filter to a row_count 'same as' comparison aggregation

Describe the bug

Applying a filter to a row_count check only applies to the metric, not the external column we're evaluating against.

I can see why this is expected behaviour, but I would like a way to optionally apply the filter to the comparison aggregation.

To Reproduce

Steps to reproduce the behavior:

Create a new test in scan.yml

filter foo [hourly]:
  where: from_iso8601_timestamp(time) between DATE_ADD('hour', -1, NOW()) AND NOW()
  
checks for foo [hourly]:
  - row_count same as bar

Run soda scan with -V verbose flag
Observe the "otherRowcount" SQL doesn't include the filter and so counts all rows

Soda Core 3.0.4
Reading configuration file "configuration.yml"
Reading SodaCL file "checks_foo.yml"
Scan execution starts
Query datalake_dev.foo[hourly].aggregation[0]:
SELECT 
  COUNT(*) 
FROM datalake.foo
WHERE from_iso8601_timestamp(time) between DATE_ADD('hour', -1, NOW()) AND NOW()
Query datalake_dev.bar.aggregation[0]:
SELECT 
  COUNT(*) 
FROM datalake.bar
Scan summary:
2/2 queries OK
  datalake_dev.foo[hourly].aggregation[0] [OK] 0:00:01.819557
  datalake_dev.bar.aggregation[0] [OK] 0:00:01.262278
1/1 check FAILED: 
    foo [hourly] in datalake_dev
      row_count same as bar [FAILED]
        value: -12047
        rowCount: 9
        otherRowCount: 12056
Oops! 1 failures. 0 warnings. 0 errors. 0 pass.


**Context**
As above

**OS**: MAC
**Python Version**: 3.8.9
**Soda SQL Version**: 3.0.4
**Warehouse Type**: AWS Athena

bug soda-sql

opened by paulskipprhudson 2

test error on scanning mysql table

Describe the bug I run test against mysql table and error occurs:

[TestResult(test=Test(id='{"expression":"row_count > 0"}', title='test(row_count > 0)', expression='row_count > 0', metrics=['row_count'], column=None, source='soda-sql'), passed=False, skipped=False, values=None, error=NameError("name 'row_count' is not defined"), group_values=None), TestResult(test=Test(id='{"column":"bus_no","expression":"invalid_percentage <= 20"}', title='column(bus_no) test(invalid_percentage <= 20)', expression='invalid_percentage <= 20', metrics=['invalid_percentage'], column='bus_no', source='soda-sql'), passed=False, skipped=False, values=None, error=NameError("name 'invalid_percentage' is not defined"), group_values=None), TestResult(test=Test(id='{"column":"incident_number","expression":"invalid_percentage == 0"}', title='column(incident_number) test(invalid_percentage == 0)', expression='invalid_percentage == 0', metrics=['invalid_percentage'], column='incident_number', source='soda-sql'), passed=False, skipped=False, values=None, error=NameError("name 'invalid_percentage' is not defined"), group_values=None), TestResult(test=Test(id='{"column":"incident_number","expression":"missing_count == 0"}', title='column(incident_number) test(missing_count == 0)', expression='missing_count == 0', metrics=['missing_count'], column='incident_number', source='soda-sql'), passed=False, skipped=False, values=None, error=NameError("name 'missing_count' is not defined"), group_values=None), TestResult(test=Test(id='{"column":"school_year","expression":"invalid_percentage == 0"}', title='column(school_year) test(invalid_percentage == 0)', expression='invalid_percentage == 0', metrics=['invalid_percentage'], column='school_year', source='soda-sql'), passed=False, skipped=False, values=None, error=NameError("name 'invalid_percentage' is not defined"), group_values=None), TestResult(test=Test(id='{"column":"schools_serviced","expression":"invalid_percentage <= 15"}', title='column(schools_serviced) test(invalid_percentage <= 15)', expression='invalid_percentage <= 15', metrics=['invalid_percentage'], column='schools_serviced', source='soda-sql'), passed=False, skipped=False, values=None, error=NameError("name 'invalid_percentage' is not defined"), group_values=None)]
Test error for "row_count > 0": name 'row_count' is not defined
Test error for "invalid_percentage <= 20": name 'invalid_percentage' is not defined
Test error for "invalid_percentage == 0": name 'invalid_percentage' is not defined
Test error for "missing_count == 0": name 'missing_count' is not defined
Test error for "invalid_percentage == 0": name 'invalid_percentage' is not defined
Test error for "invalid_percentage <= 15": name 'invalid_percentage' is not defined

To Reproduce I run this python code:

from sodasql.scan.scan_builder import ScanBuilder


scan_builder = ScanBuilder()
scan_builder.scan_yml_file = 'tables/breakdowns.yml'
scan_builder.warehouse_yml_file = 'warehouse.yml'
scan = scan_builder.build()
scan_result = scan.execute()

print(scan_result.test_results)

And if i run this command:

 soda scan warehouse.yml tables/breakdowns.yml

I get this:

 | 2.2.1
  | Scanning tables/breakdowns.yml ...
  | There is no value specified for valid_values for column incident_number
  | There is no value specified for valid_min for column incident_number
  | There is no value specified for valid_max for column incident_number
  | There is no value specified for valid_values for column school_year
  | There is no value specified for valid_min for column school_year
  | There is no value specified for valid_max for column school_year
  | There is no value specified for valid_values for column bus_no
  | There is no value specified for valid_min for column bus_no
  | There is no value specified for valid_max for column bus_no
  | There is no value specified for valid_values for column schools_serviced
  | There is no value specified for valid_min for column schools_serviced
  | There is no value specified for valid_max for column schools_serviced
  | # _do_auth(): user: mysql
  | # _do_auth(): self._auth_plugin: 
  | # _do_auth(): user: mysql
  | # _do_auth(): password: mysql
  | new_auth_plugin: caching_sha2_password
  | No Soda Cloud account configured
  | Executing SQL query: 
SELECT column_name, data_type, is_nullable 
FROM information_schema.columns 
WHERE lower(table_name) = 'breakdowns' 
  AND table_schema = 'new_york'
  | SQL took 0:00:00.007393
  |   boro (text) 
  |   breakdown_or_running_late (text) 
  |   bus_company_name (text) 
  |   bus_no (text) 
  |   busbreakdown_id (int) 
  |   created_on (text) 
  |   has_contractor_notified_parents (text) 
  |   has_contractor_notified_schools (text) 
  |   have_you_alerted_opt (text) 
  |   how_long_delayed (text) 
  |   incident_number (int) 
  |   informed_on (text) 
  |   last_updated_on (text) 
  |   number_of_students_on_the_bus (int) 
  |   occured_on (text) 
  |   reason (text) 
  |   route_number (text) 
  |   run_type (text) 
  |   school_age_or_prek (text) 
  |   school_year (text) 
  |   schools_serviced (text) 
  | 21 columns:
  | Query measurement: schema = [{'name': 'boro', 'type': 'text', 'dataType': 'text', 'nullable': True, 'logicalType': 'text', 'semanticType': 'text'}, {'name': 'breakdown_or_running_late', 'type': 'text', 'dataType': 'text', 'nullable': True, 'logicalType': 'text', 'semanticType': 'text'}, {'name': 'bus_company_name', 'type': 'text', 'dataType': 'text', 'nullable': True, 'logicalType': 'text', 'semanticType': 'text'}, {'name': 'bus_no', 'type': 'text', 'dataType': 'text', 'nullable': True, 'logicalType': 'number', 'semanticType': 'number'}, {'name': 'busbreakdown_id', 'type': 'int', 'dataType': 'int', 'nullable': True, 'logicalType': 'number', 'semanticType': 'number'}, {'name': 'created_on', 'type': 'text', 'dataType': 'text', 'nullable': True, 'logicalType': 'text', 'semanticType': 'text'}, {'name': 'has_contractor_notified_parents', 'type': 'text', 'dataType': 'text', 'nullable': True, 'logicalType': 'text', 'semanticType': 'text'}, {'name': 'has_contractor_notified_schools', 'type': 'text', 'dataType': 'text', 'nullable': True, 'logicalType': 'text', 'semanticType': 'text'}, {'name': 'have_you_alerted_opt', 'type': 'text', 'dataType': 'text', 'nullable': True, 'logicalType': 'text', 'semanticType': 'text'}, {'name': 'how_long_delayed', 'type': 'text', 'dataType': 'text', 'nullable': True, 'logicalType': 'text', 'semanticType': 'text'}, {'name': 'incident_number', 'type': 'int', 'dataType': 'int', 'nullable': True, 'logicalType': 'number', 'semanticType': 'number'}, {'name': 'informed_on', 'type': 'text', 'dataType': 'text', 'nullable': True, 'logicalType': 'text', 'semanticType': 'text'}, {'name': 'last_updated_on', 'type': 'text', 'dataType': 'text', 'nullable': True, 'logicalType': 'text', 'semanticType': 'text'}, {'name': 'number_of_students_on_the_bus', 'type': 'int', 'dataType': 'int', 'nullable': True, 'logicalType': 'number', 'semanticType': 'number'}, {'name': 'occured_on', 'type': 'text', 'dataType': 'text', 'nullable': True, 'logicalType': 'text', 'semanticType': 'text'}, {'name': 'reason', 'type': 'text', 'dataType': 'text', 'nullable': True, 'logicalType': 'text', 'semanticType': 'text'}, {'name': 'route_number', 'type': 'text', 'dataType': 'text', 'nullable': True, 'logicalType': 'text', 'semanticType': 'text'}, {'name': 'run_type', 'type': 'text', 'dataType': 'text', 'nullable': True, 'logicalType': 'text', 'semanticType': 'text'}, {'name': 'school_age_or_prek', 'type': 'text', 'dataType': 'text', 'nullable': True, 'logicalType': 'text', 'semanticType': 'text'}, {'name': 'school_year', 'type': 'text', 'dataType': 'text', 'nullable': True, 'logicalType': 'time', 'semanticType': 'time'}, {'name': 'schools_serviced', 'type': 'text', 'dataType': 'text', 'nullable': True, 'logicalType': 'number', 'semanticType': 'number'}]
  | Exception during aggregation query
Traceback (most recent call last):
  File "/Users/mkoloev/Desktop/soda_projects/soda_mysql_2/venv/lib/python3.9/site-packages/sodasql/scan/scan.py", line 254, in _query_aggregations
    fields.append(dialect.sql_expr_count_conditional(scan_column.non_missing_condition, column_name))
TypeError: sql_expr_count_conditional() takes 2 positional arguments but 3 were given
  | Test error for "row_count > 0": name 'row_count' is not defined
  | Test error for "invalid_percentage <= 20": name 'invalid_percentage' is not defined
  | Test error for "invalid_percentage == 0": name 'invalid_percentage' is not defined
  | Test error for "missing_count == 0": name 'missing_count' is not defined
  | Test error for "invalid_percentage == 0": name 'invalid_percentage' is not defined
  | Test error for "invalid_percentage <= 15": name 'invalid_percentage' is not defined
  | Executed 1 queries in 0:00:00.021183
  | Scan summary ------
  | 1 measurements computed
  | 6 tests executed
  | 6 of 6 tests failed:
  |   Test test(row_count > 0) failed with measurements null
  |   Test column(bus_no) test(invalid_percentage <= 20) failed with measurements null
  |   Test column(incident_number) test(invalid_percentage == 0) failed with measurements null
  |   Test column(incident_number) test(missing_count == 0) failed with measurements null
  |   Test column(school_year) test(invalid_percentage == 0) failed with measurements null
  |   Test column(schools_serviced) test(invalid_percentage <= 15) failed with measurements null
  | Errors occurred!
  |   [error] Exception during aggregation query
  |   [test_execution_error] Test "row_count > 0" failed
  |   [test_execution_error] Test "invalid_percentage <= 20" failed
  |   [test_execution_error] Test "invalid_percentage == 0" failed
  |   [test_execution_error] Test "missing_count == 0" failed
  |   [test_execution_error] Test "invalid_percentage == 0" failed
  |   [test_execution_error] Test "invalid_percentage <= 15" failed
  | Exiting with code 1
  | Starting new HTTPS connection (1): collect.soda.io:443
  | https://collect.soda.io:443 "POST /v1/traces HTTP/1.1" 200 0

Context warehouse.yml:

name: mysql
connection:
  type: mysql
  host: localhost
  port: 3306
  username: mysql
  password: mysql
  database: new_york

test: breakdowns.yml:

table_name: breakdowns
metrics:
  - row_count
  - missing_count
  - missing_percentage
  - values_count
  - values_percentage
  - invalid_count
  - invalid_percentage
  - valid_count
  - valid_percentage
  - avg_length
  - max_length
  - min_length
  - avg
  - sum
  - max
  - min
  - stddev
  - variance
tests:
  - row_count > 0
columns:
  incident_number:
    valid_format: number_whole
    tests:
      - invalid_percentage == 0
      - missing_count == 0
  school_year:
    valid_format: date_inverse
    tests:
      - invalid_percentage == 0
  bus_no:
    valid_format: number_whole
    tests:
      - invalid_percentage <= 20
  schools_serviced:
    valid_format: number_whole
    tests:
      - invalid_percentage <= 15

OS: macOS Monterey 12.2.1 Python Version: 3.9.5 Soda SQL Version: 2.2.1 Warehouse Type: mysql

However it works fine with postgres

bug soda-sql

opened by mrkoloev 1

Arithmetic Overflow Error in SQL Server
Describe the bug

Getting Overflow error for count validation.

I guess it is related to the query that it is being running to perform the calculations. In SQL Server, regular counts and sum leads to Overflow error when not being applied to bigint, for example (count_big in case of the count).

To Reproduce Steps to reproduce the behavior:

Create a new test in tables/something.yml for a big table

Run soda scan tables/something.yml

Context

table_name: big_table metrics: - row_count - missing_count - values_count - sum tests: - row_count > 0

OS: Python Version: Latest Soda SQL Version: Latest Warehouse Type: SQL Server
bug soda-sql
opened by lucas-lm 0

Releases(v2.1.3)

Owner

Soda Data Monitoring

GitHub Repository https://docs.soda.io/soda-sql/

Keylogger with Python which logs words into server terminal.

word_logger Experimental keylogger with Python which logs words into server terminal.

1 Nov 15, 2021

Stand-alone parser for User Access Logging from Server 2012 and newer systems

KStrike Stand-alone parser for User Access Logging from Server 2012 and newer systems BriMor Labs KStrike This script will parse data from the User Ac

69 Nov 01, 2022

Ultimate Logger - A Discord bot that logs lots of events in a channel written in python

2 Mar 27, 2022

Small toolkit for python multiprocessing logging to file

Small Toolkit for Python Multiprocessing Logging This is a small toolkit for solving unsafe python mutliprocess logging (file logging and rotation) In

1 Nov 10, 2021

Logging system for the TPC software.

tpc_logger Logging system for the TPC software. The TPC Logger class provides a singleton for logging information within C++ code or in the python API

1 Jan 10, 2022

Fancy console logger and wise assistant within your python projects

Fancy console logger and wise assistant within your python projects. Made to save tons of hours for common routines.

5 Apr 01, 2022

Pretty-print tabular data in Python, a library and a command-line utility. Repository migrated from bitbucket.org/astanin/python-tabulate.

python-tabulate Pretty-print tabular data in Python, a library and a command-line utility. The main use cases of the library are: printing small table

1.5k Jan 06, 2023

Token Logger with python

Oxy Token Stealer Features Grabs discord tokens Grabs chrome passwords Grabs edge passwords Nothing else, I don't feel like releasing full on malware

1 Feb 12, 2022

Python script to scan log files/system for unauthorized access around system

checkLogs Python script to scan log files/system for unauthorized access around Linux systems Table of contents General info Getting started Usage Gen

1 Feb 25, 2022

Integrates a UPS monitored by NUT into OctoPrint

OctoPrint UPS This OctoPrint plugin interfaces with a UPS monitored by NUT (Network UPS Tools). Requirements NUT must be configured by the user. This

11 Jul 05, 2022

GTK and Python based, system performance and usage monitoring tool

System Monitoring Center GTK3 and Python 3 based, system performance and usage monitoring tool. Features: Detailed system performance and usage usage

649 Jan 03, 2023

Scout: an open-source version of the monitoring tool

Badger Scout Scout is an open-source version of the monitoring tool used by Badg

2 Jan 13, 2022

ClusterMonitor - a very simple python script which monitors and records the CPU and RAM consumption of submitted cluster jobs

ClusterMonitor A very simple python script which monitors and records the CPU and RAM consumption of submitted cluster jobs. Usage To start recording

23 Oct 04, 2021

A Python library that tees the standard output & standard error from the current process to files on disk, while preserving terminal semantics

A Python library that tees the standard output & standard error from the current process to files on disk, while preserving terminal semantics (so breakpoint(), etc work as normal)

7 Nov 30, 2022

Track Nano accounts and notify via log file or email

nano-address-notifier Track accounts and notify via log file or email Required python libs

4 Nov 08, 2021

Log processor for nginx or apache that extracts user and user sessions and calculates other types of useful data for bot detection or traffic analysis

1 Nov 11, 2021

Soda SQL Data testing, monitoring and profiling for SQL accessible data.

Related tags

Overview

Soda SQL

Comments

Goal

Target Release: v2.2.0

Requirements

Proposed Solution

Event Collection Framework

Event Schema & Attributes

Context

User Flow

Overall Approach

Specific Requirements

Concious Limitations

Problem

Releases(v2.1.3)

Owner

Soda Data Monitoring

Keylogger with Python which logs words into server terminal.

Stand-alone parser for User Access Logging from Server 2012 and newer systems

Ultimate Logger - A Discord bot that logs lots of events in a channel written in python

Small toolkit for python multiprocessing logging to file

Logging system for the TPC software.

Fancy console logger and wise assistant within your python projects

Pretty-print tabular data in Python, a library and a command-line utility. Repository migrated from bitbucket.org/astanin/python-tabulate.

Token Logger with python

Python script to scan log files/system for unauthorized access around system

Integrates a UPS monitored by NUT into OctoPrint

GTK and Python based, system performance and usage monitoring tool

Scout: an open-source version of the monitoring tool

ClusterMonitor - a very simple python script which monitors and records the CPU and RAM consumption of submitted cluster jobs

A Python library that tees the standard output & standard error from the current process to files on disk, while preserving terminal semantics

Track Nano accounts and notify via log file or email

Log processor for nginx or apache that extracts user and user sessions and calculates other types of useful data for bot detection or traffic analysis

Soda SQL Data testing, monitoring and profiling for SQL accessible data.

Display tabular data in a visually appealing ASCII table format

This is a DemoCode for parsing through large log files and triggering an email whenever there's an error.

Ransomware leak site monitoring

Target Release: `v2.2.0`