Overview of Data Validation Using Data Contract

Concepts

Orbit's data validation functionalities center around the following concepts: - Data Contract. It is an python object containing the schema information, expected statistics, and validation tests about a dataset (the reference dataset) - Monitor. A Orbit monitor can be created from any python file, which will be scheduled to run automatically and periodically. As we saw in the hello-orbit example, you can create monitors from any python files. To create a monitor for data validation purpose, you can include in the python file code snippets that call the .validate method of a DataContract object. - Monitor Job. A monitor job is 1 run of the Monitor. Therefore, once a monitor is created and scheduled, at each scheduled time, a monitor job will get launched and execute the python file contained in the monitor. - Valdiation Report. A validation report is the outcome of the .validate method of the Data Contract object. If the .validate method is included in a Monitor, then it will produce a new validation report everytime a monitor job is run.

Now, let's go through end to end how to use Data Contract to validate data for your machine learning models. We will go through the following: - Creating a Data Contract - Validating data using a Data Contract - Creating and managing a monitor - Viewing validation report

Creating a Data Contract

The first step in using Orbit to validate data in your machine learning pipeline is to create one or more DataContract objects as baselines of your data, which can be applied to validate that future data have not deviated from expectations.

An DataContract is an python object containing the following information about a dataset (the reference dataset) - Schema of the reference dataset - Statistic of the reference dataset (e.g., distribution, minimum and maximum values, and special values in each column) - Tests that can be applied in the future to validate whether a future dataset (e.g. future data from the same data pipeline) has deviated from reference dataset

Data Contracts can be created using the DataContract class in Orbit. For example, using model training time, you can add the following snippet in your code:

import foundations
from foundations_orbit import DataContract

...
# Assume the your baseline dataset (e.g., your training dataset) is in a pandas dataframe called reference_df

# To create a Data Contract for reference_df
dc = DataContract("my_contract", reference_df)

# Save the Data Contract to a directory called "project_path"
dc.save("project_path/")

That's it. The couple lines of code this in exmaple automatically computes statistics of your reference_df and configures two default tests that can be used to validate future data, and saves all of these information in a Data Contract called my_contract in the project_path folder.

By default, when you create a DataContract object, two tests are automatically enabled and configured: - Schema test: checking for column name, data type, ordering of all columns in a dateset - Distribution test: checking for distribution for each column in a dataset. By default, it uses 'L-infinity' as a method of quantifying if a column's distribution has drifted. If the value is over 0.2, the column will be flagged as critical

Before you save the Data Contract, you can further customize it for your usage using the configure method. For example:

# Assume your DataContract object is dc and the reference data is in reference_df

dc.special_value_test.configure(attributes=ref_df.columns, thresholds={np.nan: 0.1})
dc.min_max_test.configure(['age'], lower_bound=0, upper_bound=120)
In this example, we enabled the special_value_test for all columns in our reference dataframe to check for occurence of NaN values and the min_max_test for the age column to ensure it's withtin 0 and 120.

More information about DataContract and the different tests supported can be found in the SDK reference.

Validating data using Data Contract

After you have created and saved a Data Contract from your reference dataset, you can apply it to validate data. In our example in the previous section, we created, configured, and saved a Data Contract called my_contract to the location project_path

In a different python script, let's call it validate.py

import foundations
from foundations_orbit import DataContract

# some code to load the latest data from your data pipeline
# assume the data to be validate is in a pandas dataframe called validate_df
...
dc = DataContract.load("project_path", "my_contract")
report = dc.validate(validate_df)
print(report)

In this example, throw two lines of code, we can validate that our dataset validate_df using the my_contract we created in the previous section. The .validate method applied the tests in the Data Contract on the dataset in question. For example, in the distribution_test, it will compute the distribution breakdown of validate_df and compared with the distribution of the reference_df, which is stored in the my_contract object.

If you just run the validate.py as regular python script, e.g. by running python validate.py, it will print out the report object, which is a json object that summaries the outcome of our data validation.

However, it'd be very cumbersome if we have to manually run python validate.py whenever we want to validate new data in our data pipeline. In addition, the json printed is really hard to read. Next, let's see how we can create a monitor to automatically run validate.py for us.

Creating a monitor for data validation

Now we have our Data Contract and validation python script defined, next we can create an Orbit monitor by running the following command in the terminal:

foundations monitor create --name monitor1 --project_name example-project . validate.py

We'd expect to see the following feedback:

Foundations INFO: Creating monitor ...
Foundations INFO: Job bundle submitted.
Foundations INFO: Monitor scheduled.
Successfully created monitor monitor1 in project example-project

More information about the command line interface can be found in the CLI reference.

Next, let's head to the Orbit GUI and do the following: - On the GUI, click on the example-project project, it should lead you to the Monitor Schedules tab - In the Monitor Schedules tab, you should see a monitor called monitor1. Click on the monitor will show you details about this monitor - Under Schedule Details, set the schedule to run the monitor (e.g. let's run it every minute at the 10th second) - Once you click "Save", the Monitor's status will becomes Active and times in Next Run will get updated - At the scheduled times (shown under Next Run), you can see list of jobs under the Monitor Jobs section (you will need to click the refresh button) - When a job is being executed, under status it will show a blinking green circle. - When a job is done, under status it will show a solid green circle. - To pause a monitor, please click the pause button located below the name of the monitor

Viewing validation report