Overview of Data Validation Using Data Contract¶
Concepts¶
Orbit's data validation functionalities center around the following concepts:
- Data Contract. It is an python object containing the schema information, expected statistics, and validation tests about a dataset (the reference dataset)
- Monitor. A Orbit monitor can be created from any python file, which will be scheduled to run automatically and periodically. As we saw in the hello-orbit example, you can create monitors from any python files. To create a monitor for data validation purpose, you can include in the python file code snippets that call the .validate
method of a DataContract object.
- Monitor Job. A monitor job is 1 run of the Monitor. Therefore, once a monitor is created and scheduled, at each scheduled time, a monitor job will get launched and execute the python file contained in the monitor.
- Valdiation Report. A validation report is the outcome of the .validate
method of the Data Contract object. If the .validate
method is included in a Monitor, then it will produce a new validation report everytime a monitor job is run.
Now, let's go through end to end how to use Data Contract to validate data for your machine learning models. We will go through the following: - Creating a Data Contract - Validating data using a Data Contract - Creating and managing a monitor - Viewing validation report
Creating a Data Contract¶
The first step in using Orbit to validate data in your machine learning pipeline is to create one or more DataContract objects as baselines of your data, which can be applied to validate that future data have not deviated from expectations.
An DataContract is an python object containing the following information about a dataset (the reference dataset) - Schema of the reference dataset - Statistic of the reference dataset (e.g., distribution, minimum and maximum values, and special values in each column) - Tests that can be applied in the future to validate whether a future dataset (e.g. future data from the same data pipeline) has deviated from reference dataset
Data Contracts can be created using the DataContract class in Orbit. For example, using model training time, you can add the following snippet in your code:
import foundations from foundations_orbit import DataContract ... # Assume the your baseline dataset (e.g., your training dataset) is in a pandas dataframe called reference_df # To create a Data Contract for reference_df dc = DataContract("my_contract", reference_df) # Save the Data Contract to a directory called "project_path" dc.save("project_path/")
That's it. The couple lines of code this in exmaple automatically computes statistics of your reference_df
and configures two default tests that can be used to validate future data, and saves all of these information in a Data Contract called my_contract
in the project_path
folder.
By default, when you create a DataContract object, two tests are automatically enabled and configured:
- Schema test: checking for column name, data type, ordering of all columns in a dateset
- Distribution test: checking for distribution for each column in a dataset. By default, it uses 'L-infinity' as a method of quantifying if a column's distribution has drifted. If the value is over 0.2, the column will be flagged as critical
Before you save the Data Contract, you can further customize it for your usage using the configure
method. For example:
# Assume your DataContract object is dc and the reference data is in reference_df dc.special_value_test.configure(attributes=ref_df.columns, thresholds={np.nan: 0.1}) dc.min_max_test.configure(['age'], lower_bound=0, upper_bound=120)
special_value_test
for all columns in our reference dataframe to check for occurence of NaN
values and the min_max_test
for the age
column to ensure it's withtin 0 and 120.
More information about DataContract and the different tests supported can be found in the SDK reference.
Validating data using Data Contract¶
After you have created and saved a Data Contract from your reference dataset, you can apply it to validate data. In our example in the previous section, we created, configured, and saved a Data Contract called my_contract
to the location project_path
In a different python script, let's call it validate.py
import foundations from foundations_orbit import DataContract # some code to load the latest data from your data pipeline # assume the data to be validate is in a pandas dataframe called validate_df ... dc = DataContract.load("project_path", "my_contract") report = dc.validate(validate_df) print(report)
In this example, throw two lines of code, we can validate that our dataset validate_df
using the my_contract
we created in the previous section. The .validate
method applied the tests in the Data Contract on the dataset in question. For example, in the distribution_test
, it will compute the distribution breakdown of validate_df
and compared with the distribution of the reference_df
, which is stored in the my_contract
object.
If you just run the validate.py
as regular python script, e.g. by running python validate.py
, it will print out the report object, which is a json object that summaries the outcome of our data validation.
However, it'd be very cumbersome if we have to manually run python validate.py
whenever we want to validate new data in our data pipeline. In addition, the json printed is really hard to read. Next, let's see how we can create a monitor to automatically run validate.py
for us.
Creating a monitor for data validation¶
Now we have our Data Contract and validation python script defined, next we can create an Orbit monitor by running the following command in the terminal:
foundations monitor create --name monitor1 --project_name example-project . validate.py
We'd expect to see the following feedback:
Foundations INFO: Creating monitor ... Foundations INFO: Job bundle submitted. Foundations INFO: Monitor scheduled. Successfully created monitor monitor1 in project example-project
More information about the command line interface can be found in the CLI reference.
Next, let's head to the Orbit GUI and do the following: - On the GUI, click on the example-project project, it should lead you to the Monitor Schedules tab - In the Monitor Schedules tab, you should see a monitor called monitor1. Click on the monitor will show you details about this monitor - Under Schedule Details, set the schedule to run the monitor (e.g. let's run it every minute at the 10th second) - Once you click "Save", the Monitor's status will becomes Active and times in Next Run will get updated - At the scheduled times (shown under Next Run), you can see list of jobs under the Monitor Jobs section (you will need to click the refresh button) - When a job is being executed, under status it will show a blinking green circle. - When a job is done, under status it will show a solid green circle. - To pause a monitor, please click the pause button located below the name of the monitor