Overview¶
Data Contract tests are used to validate an incoming dataset against a reference dataset.
The tests are executed when calling the validate
method on a data contract that was generated against a reference dataset.
The tests compare statistics from the current datset to those of the reference dataset (captured when creating the data contract)
to determine if each attribute (column) in the current dataset has passed (healthy
) or failed (critical
) the test.
The result of the Data Contract tests is a validation report that can be viewed in the Data Health tab in the GUI.
The following tests are currently supported by Orbit's Data Contracts.
Test | Description | Supported Types | Default |
---|---|---|---|
Schema | Compares the schema of the incoming dataset and reference dataset to validate that columns and their types are preserved | Integer, Float, Boolean, Datetime, Category, Object | Enabled |
Distribution | Compares the distribution of the incoming dataset to the reference dataset using either the L-infinity or PSI distance metrics | Integer, Float, Boolean, Datetime, Category | Enabled |
Special Value | Compares the percentage point difference of specified special values between the incoming and reference dataset | Integer, Float, Datetime, Category | Disabled |
Min-Max | Compares the min and max values of attributes between the incoming and reference dataset | Integer, Float, Datetime | Disabled |
Schema Test¶
Schema Test compares the Schema of your reference and current dataframe. For every column in the dataframe the following are properties are compared:
- Column name
- Column type
If either the column name or column type don't match between the reference and current data, Schema test will output Critical result for that column.
If Schema Test results in critical for any column, no other data contract tests will be performed on that column.
By default, Schema Test is always performed on all columns of your data. No configuration is needed.
Allowed Column Data Types
- integer
- float
- boolean
- datetime
- category
- object
Distribution Test¶
Distribution Test detects shifts in the distribution of the data between the reference and current dataframes.
The test is performed as follows:
-
Bins are created from the Reference dataframe
- If configured on a nominal (categorical) attribute: reference dataframe values are binned into unique bins for each unique value (unique values that take up less than 1% of the total data in a column get combined into one bin).
- If configured on an ordinal (numerical) attribute: reference dataframe values are binned into 'max-bins' number of bins, with the same number of elements in each bin.
-
Current dataframe values are binned into the same bins as the reference.
-
A distance metric is used to calculate the distribution shift between the size of the bins (between the reference and current dataframe).
-
The test results in
critical
if the distance metric is greater than a user specified threshold.
Allowed Column Data Types
- integer
- float
- boolean
- datetime
- category
Configure¶
Description
- By default, Distribution Test is configured to be performed on all columns.
DataContract.distribution_test.configure(attributes, threshold, method)
-
attributes (list): A list of attribute names.
-
threshold (number between 0 and 1): the minimum value of the distribution shift metric to consider the result of the test as Critical.
-
method ('l-infinity' or 'psi' (based on kl-divergence): the distance metric used (between bins / categories) to determine the results of a distribution test.
Returns
- None
Example: Configuring a distribution test on two columns, with a threshold of 0.2 using the 'psi' distance metric.
from foundations_orbit import DataContract contract = DataContract("contract_name", data) contract.distribution_test.configure(["col1","col2"], threshold=0.2, method='psi')
Example: Overriding distribution test on 'col1', while also configuring a distribution test on 'col3'.
from foundations_orbit import DataContract contract = DataContract("contract_name", data) contract.distribution_test.configure(["col1"], threshold=0.2, method='psi') contract.distribution_test.configure(["col1","col3"], threshold=0.1, method='l-infinity')
Note: Calling configure twice on the same data contract will overwrite the setting for attributes that have already been configured, and configure new ones.
Exclude¶
Description
- Attributes can be excluded from a distribution test. If an attribute has already been configured, either by the user or by default and the exclude method is called on that attribute, Distribution test will not be performed on that attribute.
DataContract.distribution_test.exclude(attributes)
- attributes (list): A list of attribute names to exclude from the test.
Returns
- None
Raises
Example
from foundations_orbit import DataContract contract = DataContract("contract_name", data) contract.distribution_test.exclude(["col1","col2"])
Note: If the exclude method is called on an attribute that has not been configured either by default or by the user, it has no effect on that attribute.
Special Value Test¶
Specival Value Test detects shifts in any set of specific values in the data between the reference and current dataframes.
For each configured attribute and special value pair, the test is performed as follows:
-
Special values in the reference dataframe are counted, and stored as a percentage of total values (per attribute).
-
Step 1 is performed on the current dataframe.
-
The percentage change in the resulting stored values from the reference and current dataframe is calculated.
-
The test results in
critical
if the percentage change is greater than a user specified threshold.
Allowed Column Data Types
- integer
- float
- datetime
- category
Configure¶
Description
- By default, Special Value Test is not configured on any columns.
DataContract.special_value_test.configure(attributes, thresholds)
-
attributes (list): A list of attribute names.
-
thresholds (dictionary in the form {object: number between 0 and 1}): A dictionary where the keys are any user defined 'special value', and the values are the thresholds that correspond to the keys - which determines the size of shift for a test to result in a
critical
.
Returns
- None
Example: Configuring a special value test for the values 'np.nan' and '-1'
import numpy as np from foundations_orbit import DataContract contract = DataContract("contract_name", data) contract.special_value_test.configure(["col1","col2"], thresholds={np.nan: 0.1, -1: 0.2})
Example: Overriding a special value test ('np.nan' and '-1') on 'col1' with a new special value, and configuring a new test for 'col3'.
import numpy as np from foundations_orbit import DataContract contract = DataContract("contract_name", data) contract.special_value_test.configure(["col1"], thresholds={np.nan: 0.1, -1: 0.2}) contract.special_value_test.configure(["col1","col3"], thresholds={999: 0.2})
Note: Calling configure twice on the same data contract will overwrite the setting for attributes that have already been configured, and configure new ones.
Exclude¶
Description
- Attributes can be excluded from a Special Value test. If an attribute has already been configured, either by the user or by default and the exclude method is called on that attribute, Special Value test will not be performed on that attribute.
DataContract.distribution_test.exclude(attributes)
- attributes (list): A list of attribute names to exclude from the test.
Returns
- None
Example
from foundations_orbit import DataContract contract = DataContract("contract_name", data) contract.special_value_test.exclude(["col1","col2"])
Min-Max Test¶
The Minimum and Maximum test enforces that all values within an attribute are greater than user specificed 'minimum', and / or less than a user specified 'maximum'.
For each configured attribute, the test is performed as follows:
-
The minimum / maximum value is computed for the attribute.
-
The test results in
critical
if the computed values are less than / greater than the specified lower bound and upper bound specified.
Allowed Column Data Types
- integer
- float
- datetime
Configure¶
Description
- By default, Minimum and Maximum Test is not configured on any columns.
DataContract.min_max_test.configure(attributes, lower_bound, upper_bound)
-
attributes (list): A list of attribute names.
-
lower_bound (number): The lower bound for the attribute(s). If any value in the attribute(s) is below this, the test will result in a
critical
. -
upper_bound (number): The upper bound for the attribute(s). If any value in the attribute(s) is above this, the test will result in a
critical
.
Returns
- None
Example: Configuring a min max test on 'col1' and 'col2'
from foundations_orbit import DataContract contract = DataContract("contract_name", data) contract.min_max_test.configure(["col1","col2"], lower_bound=-10, upper_bound=1000.5)
Example: Overriding a min max test on 'col1', while configuring a new one on 'col3'
from foundations_orbit import DataContract contract = DataContract("contract_name", data) contract.min_max_test.configure(["col1"], upper_bound=100) contract.min_max_test.configure(["col1", "col3"], lower_bound=50)
Note: Either lower_bound or upper_bound must be provided or configure will throw an error.
Note: Calling configure twice on the same data contract will overwrite the setting for attributes that have already been configured, and configure new ones.
Exclude¶
Description
- Attributes can be excluded from a Minimum and Maximum test. If an attribute has already been configured, either by the user or by default and the exclude method is called on that attribute, Minimum and Maximum test will not be performed on that attribute.
DataContract.min_max_test.exclude(attributes)
- attributes (list): A list of attribute names to exclude from the test.
Returns
- None
Example
from foundations_orbit import DataContract contract = DataContract("contract_name", data) contract.min_max_test.exclude(["col1","col2"])