Overview

Data Contract tests are used to validate an incoming dataset against a reference dataset. The tests are executed when calling the validate method on a data contract that was generated against a reference dataset.

The tests compare statistics from the current datset to those of the reference dataset (captured when creating the data contract) to determine if each attribute (column) in the current dataset has passed (healthy) or failed (critical) the test.

The result of the Data Contract tests is a validation report that can be viewed in the Data Health tab in the GUI.

The following tests are currently supported by Orbit's Data Contracts.

Test Description Supported Types Default
Schema Compares the schema of the incoming dataset and reference dataset to validate that columns and their types are preserved Integer, Float, Boolean, Datetime, Category, Object Enabled
Distribution Compares the distribution of the incoming dataset to the reference dataset using either the L-infinity or PSI distance metrics Integer, Float, Boolean, Datetime, Category Enabled
Special Value Compares the percentage point difference of specified special values between the incoming and reference dataset Integer, Float, Datetime, Category Disabled
Min-Max Compares the min and max values of attributes between the incoming and reference dataset Integer, Float, Datetime Disabled

Schema Test

Schema Test compares the Schema of your reference and current dataframe. For every column in the dataframe the following are properties are compared:

  • Column name
  • Column type

If either the column name or column type don't match between the reference and current data, Schema test will output Critical result for that column.

If Schema Test results in critical for any column, no other data contract tests will be performed on that column.

By default, Schema Test is always performed on all columns of your data. No configuration is needed.

Allowed Column Data Types

  • integer
  • float
  • boolean
  • datetime
  • category
  • object

Distribution Test

Distribution Test detects shifts in the distribution of the data between the reference and current dataframes.

The test is performed as follows:

  1. Bins are created from the Reference dataframe

    • If configured on a nominal (categorical) attribute: reference dataframe values are binned into unique bins for each unique value (unique values that take up less than 1% of the total data in a column get combined into one bin).
    • If configured on an ordinal (numerical) attribute: reference dataframe values are binned into 'max-bins' number of bins, with the same number of elements in each bin.
  2. Current dataframe values are binned into the same bins as the reference.

  3. A distance metric is used to calculate the distribution shift between the size of the bins (between the reference and current dataframe).

  4. The test results in critical if the distance metric is greater than a user specified threshold.

Allowed Column Data Types

  • integer
  • float
  • boolean
  • datetime
  • category

Configure

Description

  • By default, Distribution Test is configured to be performed on all columns.

DataContract.distribution_test.configure(attributes, threshold, method)
Arguments

  • attributes (list): A list of attribute names.

  • threshold (number between 0 and 1): the minimum value of the distribution shift metric to consider the result of the test as Critical.

  • method ('l-infinity' or 'psi' (based on kl-divergence): the distance metric used (between bins / categories) to determine the results of a distribution test.

Returns

  • None

Example: Configuring a distribution test on two columns, with a threshold of 0.2 using the 'psi' distance metric.

from foundations_orbit import DataContract
contract = DataContract("contract_name", data)
contract.distribution_test.configure(["col1","col2"], threshold=0.2, method='psi')

Example: Overriding distribution test on 'col1', while also configuring a distribution test on 'col3'.

from foundations_orbit import DataContract
contract = DataContract("contract_name", data)
contract.distribution_test.configure(["col1"], threshold=0.2, method='psi')
contract.distribution_test.configure(["col1","col3"], threshold=0.1, method='l-infinity')

Note: Calling configure twice on the same data contract will overwrite the setting for attributes that have already been configured, and configure new ones.

Exclude

Description

  • Attributes can be excluded from a distribution test. If an attribute has already been configured, either by the user or by default and the exclude method is called on that attribute, Distribution test will not be performed on that attribute.

DataContract.distribution_test.exclude(attributes)
Arguments

  • attributes (list): A list of attribute names to exclude from the test.

Returns

  • None

Raises

Example

from foundations_orbit import DataContract
contract = DataContract("contract_name", data)
contract.distribution_test.exclude(["col1","col2"])

Note: If the exclude method is called on an attribute that has not been configured either by default or by the user, it has no effect on that attribute.

Special Value Test

Specival Value Test detects shifts in any set of specific values in the data between the reference and current dataframes.

For each configured attribute and special value pair, the test is performed as follows:

  1. Special values in the reference dataframe are counted, and stored as a percentage of total values (per attribute).

  2. Step 1 is performed on the current dataframe.

  3. The percentage change in the resulting stored values from the reference and current dataframe is calculated.

  4. The test results in critical if the percentage change is greater than a user specified threshold.

Allowed Column Data Types

  • integer
  • float
  • datetime
  • category

Configure

Description

  • By default, Special Value Test is not configured on any columns.

DataContract.special_value_test.configure(attributes, thresholds)
Arguments

  • attributes (list): A list of attribute names.

  • thresholds (dictionary in the form {object: number between 0 and 1}): A dictionary where the keys are any user defined 'special value', and the values are the thresholds that correspond to the keys - which determines the size of shift for a test to result in a critical.

Returns

  • None

Example: Configuring a special value test for the values 'np.nan' and '-1'

import numpy as np
from foundations_orbit import DataContract
contract = DataContract("contract_name", data)
contract.special_value_test.configure(["col1","col2"], thresholds={np.nan: 0.1, -1: 0.2})

Example: Overriding a special value test ('np.nan' and '-1') on 'col1' with a new special value, and configuring a new test for 'col3'.

import numpy as np
from foundations_orbit import DataContract
contract = DataContract("contract_name", data)
contract.special_value_test.configure(["col1"], thresholds={np.nan: 0.1, -1: 0.2})
contract.special_value_test.configure(["col1","col3"], thresholds={999: 0.2})

Note: Calling configure twice on the same data contract will overwrite the setting for attributes that have already been configured, and configure new ones.

Exclude

Description

  • Attributes can be excluded from a Special Value test. If an attribute has already been configured, either by the user or by default and the exclude method is called on that attribute, Special Value test will not be performed on that attribute.

DataContract.distribution_test.exclude(attributes)
Arguments

  • attributes (list): A list of attribute names to exclude from the test.

Returns

  • None

Example

from foundations_orbit import DataContract
contract = DataContract("contract_name", data)
contract.special_value_test.exclude(["col1","col2"])

Min-Max Test

The Minimum and Maximum test enforces that all values within an attribute are greater than user specificed 'minimum', and / or less than a user specified 'maximum'.

For each configured attribute, the test is performed as follows:

  1. The minimum / maximum value is computed for the attribute.

  2. The test results in critical if the computed values are less than / greater than the specified lower bound and upper bound specified.

Allowed Column Data Types

  • integer
  • float
  • datetime

Configure

Description

  • By default, Minimum and Maximum Test is not configured on any columns.

DataContract.min_max_test.configure(attributes, lower_bound, upper_bound)
Arguments

  • attributes (list): A list of attribute names.

  • lower_bound (number): The lower bound for the attribute(s). If any value in the attribute(s) is below this, the test will result in a critical.

  • upper_bound (number): The upper bound for the attribute(s). If any value in the attribute(s) is above this, the test will result in a critical.

Returns

  • None

Example: Configuring a min max test on 'col1' and 'col2'

from foundations_orbit import DataContract
contract = DataContract("contract_name", data)
contract.min_max_test.configure(["col1","col2"], lower_bound=-10, upper_bound=1000.5)

Example: Overriding a min max test on 'col1', while configuring a new one on 'col3'

from foundations_orbit import DataContract
contract = DataContract("contract_name", data)
contract.min_max_test.configure(["col1"], upper_bound=100)
contract.min_max_test.configure(["col1", "col3"], lower_bound=50)

Note: Either lower_bound or upper_bound must be provided or configure will throw an error.

Note: Calling configure twice on the same data contract will overwrite the setting for attributes that have already been configured, and configure new ones.

Exclude

Description

  • Attributes can be excluded from a Minimum and Maximum test. If an attribute has already been configured, either by the user or by default and the exclude method is called on that attribute, Minimum and Maximum test will not be performed on that attribute.

DataContract.min_max_test.exclude(attributes)
Arguments

  • attributes (list): A list of attribute names to exclude from the test.

Returns

  • None

Example

from foundations_orbit import DataContract
contract = DataContract("contract_name", data)
contract.min_max_test.exclude(["col1","col2"])