Data Contract Fundamentals
Introduction¶
The DataContract
class is the quintessential features of Orbit
. Data Contracts allow you to baseline your reference data and validate future data to catch errors in time, in your training and / or inference data.
Functionality available when creating a Data Contract.
- Initializing a Data Contract
- Customizing a Data Contract
- Saving a Data Contract
- Viewing the Specifications of a Data Contract
Functionality available when validating with a Data Contract.
Initializing¶
Creates a DataContract
object that can be used to configure, validate, save and load. Uses the reference dataframe to compare incoming data and generate informative reports.
DataContract(data_contract_name, reference_dataframe)
Positional Arguments
- data_contract_name (str): The unique name of your data contract.
- reference_dataframe (pandas DataFrame): The dataframe used as the source of truth for all data statistics and information calculated.
Returns
- A
DataContract
object.
Example
import pandas as pd from foundations_orbit import DataContract dataframe = pd.load_csv('my_dataframe.csv') data_contract = DataContract('reference_dc', dataframe)
Note: Once the Data Contract is initialized with a reference dataframe, it cannot be reconfigured with another dataframe during its lifetime.
Customizing¶
A DataContract
is comprised of multiple data validation tests which can be configured individually.
Each test has a set of attributes it accepts for configuration.
For more information about tests, please refer to the Data Contract Tests documentation.
Columns from the dataset can be globally excluded from all tests, which removes them from the DataContract
if they were configured either by default, or by the user as follows:
Excluding¶
Excludes columns from all tests.
DataContract.exclude(attributes)
Positional Arguments
- attributes (list of str): List of column names to configure.
Returns
- None
Validating¶
Runs all test that were configured on the data contract. Validate generates a json report that contains all information regarding the test run such as metadata, columns used for tests, test passes/failed status and appropriate error messages.
DataContract.validate(dataframe)
Positional Arguments
- dataframe: The dataframe that will be used to run the evaluation against the reference dataframe.
Returns
- A string of the validation results in JSON format.
Example
import pandas as pd from foundations_orbit import DataContract dataframe = pd.load_csv('ref_dataframe.csv') dataframe_to_validate = pd.load_csv('validation_dataframe.csv') data_contract = DataContract('reference_dc', dataframe) validation_results = data_contract.validate(dataframe_to_validate)
Note: If Schema Test fails on any column(s), that column(s) is excluded from all other tests.
Saving¶
Once the tests have been configured a data contract can be saved to be used later.
from foundations_orbit import DataContract data_contract = DataContract(contract_name, reference_dataframe) DataContract.save(dir_path_to_save)
Positional Arguments
- dir_path_to_save: Path to where data_contract should be saved. File name will be of the format
contract_name.pkl
.
Returns
- None
Example
import pandas as pd from foundations_orbit import DataContract dataframe = pd.load_csv('ref_dataframe.csv') data_contract = DataContract('reference_dc', dataframe) data_contract.save('.')
Loading¶
A data contract that was saved earlier can be loaded from disk to run validation.
DataContract.load(dir_path, contract_name)
- dir_path: Directory path to where data contract was saved.
- contract_name: Name of the contract to be loaded.
Returns
- A
DataContract
object.
Example
import pandas as pd from foundations_orbit import DataContract dataframe = pd.load_csv('ref_dataframe.csv') data_contract = DataContract('reference_dc', dataframe) data_contract.save('data_contracts') ... data_contract = DataContract.load('data_contracts', 'reference_dc')
Viewing Specifications¶
A DataContract
contains information on multiple tests, for which attributes those tests have been configured on, and the parameters that the tests were configured with.
This information is easily accessible through the DataContract.info() method.
DataContract.info()
- None
Returns
- A nested dictionary object in the following format:
{test_name1: test_info1, test_name2: test_info2...}
. The dictionary contains one key (test_name) for each test in theDataContract
, with the corresponding value (test_info) being a dictionary describing the specifications for each test.
Example
import pandas as pd from foundations_orbit import DataContract dataframe = pd.load_csv('ref_dataframe.csv') data_contract = DataContract('reference_dc', dataframe) reference_dc_info = data_contract.info() print(reference_dc_info)