Data Contract Fundamentals

Introduction

The DataContract class is the quintessential features of Orbit. Data Contracts allow you to baseline your reference data and validate future data to catch errors in time, in your training and / or inference data.

Functionality available when creating a Data Contract.

Functionality available when validating with a Data Contract.


Initializing

Creates a DataContract object that can be used to configure, validate, save and load. Uses the reference dataframe to compare incoming data and generate informative reports.

DataContract(data_contract_name, reference_dataframe)

Positional Arguments

  • data_contract_name (str): The unique name of your data contract.
  • reference_dataframe (pandas DataFrame): The dataframe used as the source of truth for all data statistics and information calculated.

Returns

  • A DataContract object.

Example

import pandas as pd
from foundations_orbit import DataContract

dataframe = pd.load_csv('my_dataframe.csv')
data_contract = DataContract('reference_dc', dataframe)

Note: Once the Data Contract is initialized with a reference dataframe, it cannot be reconfigured with another dataframe during its lifetime.


Customizing

A DataContract is comprised of multiple data validation tests which can be configured individually. Each test has a set of attributes it accepts for configuration. For more information about tests, please refer to the Data Contract Tests documentation. Columns from the dataset can be globally excluded from all tests, which removes them from the DataContract if they were configured either by default, or by the user as follows:

Excluding

Excludes columns from all tests.

DataContract.exclude(attributes)

Positional Arguments

  • attributes (list of str): List of column names to configure.

Returns

  • None

Validating

Runs all test that were configured on the data contract. Validate generates a json report that contains all information regarding the test run such as metadata, columns used for tests, test passes/failed status and appropriate error messages.

DataContract.validate(dataframe)

Positional Arguments

  • dataframe: The dataframe that will be used to run the evaluation against the reference dataframe.

Returns

  • A string of the validation results in JSON format.

Example

import pandas as pd
from foundations_orbit import DataContract

dataframe = pd.load_csv('ref_dataframe.csv')
dataframe_to_validate = pd.load_csv('validation_dataframe.csv')
data_contract = DataContract('reference_dc', dataframe)
validation_results = data_contract.validate(dataframe_to_validate)

Note: If Schema Test fails on any column(s), that column(s) is excluded from all other tests.


Saving

Once the tests have been configured a data contract can be saved to be used later.

from foundations_orbit import DataContract

data_contract = DataContract(contract_name, reference_dataframe)

DataContract.save(dir_path_to_save)

Positional Arguments

  • dir_path_to_save: Path to where data_contract should be saved. File name will be of the format contract_name.pkl.

Returns

  • None

Example

import pandas as pd
from foundations_orbit import DataContract

dataframe = pd.load_csv('ref_dataframe.csv')
data_contract = DataContract('reference_dc', dataframe)
data_contract.save('.')


Loading

A data contract that was saved earlier can be loaded from disk to run validation.

DataContract.load(dir_path, contract_name)
Positional Arguments

  • dir_path: Directory path to where data contract was saved.
  • contract_name: Name of the contract to be loaded.

Returns

  • A DataContract object.

Example

import pandas as pd
from foundations_orbit import DataContract

dataframe = pd.load_csv('ref_dataframe.csv')
data_contract = DataContract('reference_dc', dataframe)
data_contract.save('data_contracts')
...
data_contract = DataContract.load('data_contracts', 'reference_dc')


Viewing Specifications

A DataContract contains information on multiple tests, for which attributes those tests have been configured on, and the parameters that the tests were configured with. This information is easily accessible through the DataContract.info() method.

DataContract.info()
Positional Arguments

  • None

Returns

  • A nested dictionary object in the following format:{test_name1: test_info1, test_name2: test_info2...}. The dictionary contains one key (test_name) for each test in the DataContract, with the corresponding value (test_info) being a dictionary describing the specifications for each test.

Example

import pandas as pd
from foundations_orbit import DataContract

dataframe = pd.load_csv('ref_dataframe.csv')
data_contract = DataContract('reference_dc', dataframe)
reference_dc_info = data_contract.info()
print(reference_dc_info)