Data Validation with Python

Feature Validation with the Hopsworks Feature Store

In this notebook we introduce feature validation operations with the Hopsworks Feature Store and its client API, hsfs.

Background

Motivation

Data ingested into the Feature Store form the basis for the data fed as input to algorithms that develope machine learning models. The Feature store is a place where curated feature data is stored, therefore it is important that this data is validated against different rules to it adheres to business requirements.

For example, ingested features might be expected to never be empty or to lie within a certain range, for example a feature age should always be a non-negative number.

The Hopsworks Feature Store provides users with an API to create Expectations on ingested feature data by utilizing the Deequ https://github.com/awslabs/deequ open source library. Feature validation is part of the HSFS Java/Scala and Python API for working with Feature Groups. Users work with the abstractions:

  • Rules: A set of validation rules applied on a Spark/PySpark dataframe that is inserted into a Feature Group.
  • Expectations: A set of rules that is applied on a set of features as provided by the user. Expecations are created at the feature store level and can be attached to multiple feature groups.
  • Validations: The results of expectations against the ingested dataframe are assigned a ValidationTime and are persisted within the Feature Store. Users can then retrieve validation results by validation time and by commit time for time-travel enabled feature groups.

Feature Validation is disabled by default, by having the validation_type feature group attribute set to NONE. The list of allowed validation types are: - STRICT: Data validation is performed and feature group is updated only if validation status is “Success” - WARNING: Data validation is performed and feature group is updated only if validation status is “Warning” or lower - ALL: Data validation is performed and feature group is updated only if validation status is “Failure” or lower - NONE: Data validation not performed on feature group

Examples

Create time travel enabled feature group and Bulk Insert Sample Dataset

For this demo we will use small sample of the Agarwal Generator that is a widely used dataset. It contains the hypothetical data of people applying for a loan. Rakesh Agrawal, Tomasz Imielinksi, and Arun Swami, "Database Mining: A Performance Perspective", IEEE Transactions on Knowledge and Data Engineering, 5(6), December 1993. <br/><br/>

For simplicity of demo purposes we split Agarwal dataset into 3 freature groups and demostrate feature validaton on the economy_fg feature group:
  • economy_fg with customer id, salary, loan, value of house, age of house, commission and type of car features;

Importing necessary libraries

import hsfs
from hsfs.rule import Rule
import datetime
from pyspark.sql import DataFrame, Row
from pyspark.sql.types import *
from pyspark.sql.functions import unix_timestamp, from_unixtime

connection = hsfs.connection()
# get a reference to the feature store, you can access also shared feature stores by providing the feature store name
fs = connection.get_feature_store();
Starting Spark application
IDYARN Application IDKindStateSpark UIDriver log
6application_1612535100309_0042pysparkidleLinkLink
SparkSession available as 'spark'.
Connected. Call `.close()` to terminate connection gracefully.
economy_fg_schema = StructType([
  StructField("id", IntegerType(), True),
  StructField("salary", FloatType(), True),
  StructField("commission", FloatType(), True),
  StructField("car", StringType(), True), 
  StructField("hvalue", FloatType(), True),      
  StructField("hyears", IntegerType(), True),     
  StructField("loan", FloatType(), True),
  StructField("year", IntegerType(), True)    
])

Create spark dataframes for each Feature groups

economy_bulk_insert_data = [
    Row(1, 110499.73, 0.0,  "car15",  235000.0, 30, 354724.18, 2020),
    Row(2, 140893.77, 0.0,  "car20",  135000.0, 2, 395015.33, 2020),
    Row(3, 119159.65, 0.0,  "car1", 145000.0, 22, 122025.08, 2020),
    Row(4, 20000.0, 52593.63, "car9", 185000.0, 30, 99629.62, 2020)    
]

economy_bulk_insert_df = spark.createDataFrame(economy_bulk_insert_data, economy_fg_schema)
economy_bulk_insert_df.show()
+---+---------+----------+-----+--------+------+---------+----+
| id|   salary|commission|  car|  hvalue|hyears|     loan|year|
+---+---------+----------+-----+--------+------+---------+----+
|  1|110499.73|       0.0|car15|235000.0|    30| 354724.2|2020|
|  2|140893.77|       0.0|car20|135000.0|     2|395015.34|2020|
|  3|119159.65|       0.0| car1|145000.0|    22|122025.08|2020|
|  4|  20000.0|  52593.63| car9|185000.0|    30| 99629.62|2020|
+---+---------+----------+-----+--------+------+---------+----+

Data Validation

The next sections shows you how to create feature store expectations, attach them to feature groups, and apply them to dataframes being appended to the feature group.

Discover data validation rules supported in Hopsworks

Hopsworks comes shipped with a set of data validation rules. These rules are immutable, uniquely identified by name and are available across all feature stores. These rules are used to create feature store expectations which can then be attached to feature groups.

# Get all rule definitions available in Hopsworks
rules = connection.get_rules()
[print(rule.to_dict()) for rule in rules]
{'name': 'HAS_SIZE', 'predicate': 'VALUE', 'valueType': 'Integral', 'description': 'A rule that asserts the number of rows of the dataframe'}
{'name': 'HAS_MEAN', 'predicate': 'VALUE', 'valueType': 'Fractional', 'description': 'A rule that asserts on the mean of the feature'}
{'name': 'HAS_DATATYPE', 'predicate': 'ACCEPTED_TYPE', 'valueType': 'String', 'description': ''}
{'name': 'HAS_SUM', 'predicate': 'VALUE', 'valueType': 'Fractional', 'description': 'A rule that asserts on the sum of the feature'}
{'name': 'IS_LESS_THAN', 'predicate': 'LEGAL_VALUES', 'valueType': 'Fractional', 'description': ''}
{'name': 'IS_GREATER_THAN_OR_EQUAL_TO', 'predicate': 'LEGAL_VALUES', 'valueType': 'Fractional', 'description': ''}
{'name': 'HAS_STANDARD_DEVIATION', 'predicate': 'VALUE', 'valueType': 'Fractional', 'description': ''}
{'name': 'HAS_PATTERN', 'predicate': 'PATTERN', 'valueType': 'String', 'description': ''}
{'name': 'HAS_NUMBER_OF_DISTINCT_VALUES', 'predicate': 'VALUE', 'valueType': 'Integral', 'description': ''}
{'name': 'HAS_MAX', 'predicate': 'VALUE', 'valueType': 'Fractional', 'description': 'A rule that asserts on the max of the feature'}
{'name': 'IS_CONTAINED_IN', 'predicate': 'LEGAL_VALUES', 'valueType': 'String', 'description': ''}
{'name': 'IS_NON_NEGATIVE', 'predicate': 'VALUE', 'valueType': 'Fractional', 'description': ''}
{'name': 'IS_POSITIVE', 'predicate': 'VALUE', 'valueType': 'Boolean', 'description': ''}
{'name': 'HAS_MUTUAL_INFORMATION', 'predicate': 'LEGAL_VALUES', 'valueType': 'Fractional', 'description': ''}
{'name': 'HAS_UNIQUE_VALUE_RATIO', 'predicate': 'VALUE', 'valueType': 'Fractional', 'description': ''}
{'name': 'IS_GREATER_THAN', 'predicate': 'LEGAL_VALUES', 'valueType': 'Fractional', 'description': ''}
{'name': 'HAS_COMPLETENESS', 'predicate': 'VALUE', 'valueType': 'Fractional', 'description': ''}
{'name': 'HAS_ENTROPY', 'predicate': 'VALUE', 'valueType': 'Fractional', 'description': ''}
{'name': 'HAS_MIN', 'predicate': 'VALUE', 'valueType': 'Fractional', 'description': 'A rule that asserts on the min of the feature'}
{'name': 'HAS_UNIQUENESS', 'predicate': 'VALUE', 'valueType': 'Fractional', 'description': ''}
{'name': 'HAS_DISTINCTNESS', 'predicate': 'VALUE', 'valueType': 'Fractional', 'description': ''}
{'name': 'HAS_CORRELATION', 'predicate': 'LEGAL_VALUES', 'valueType': 'Fractional', 'description': ''}
{'name': 'HAS_APPROX_QUANTILE', 'predicate': 'VALUE', 'valueType': 'Fractional', 'description': ''}
{'name': 'HAS_APPROX_COUNT_DISTINCT', 'predicate': 'VALUE', 'valueType': 'Fractional', 'description': ''}
{'name': 'IS_LESS_THAN_OR_EQUAL_TO', 'predicate': 'LEGAL_VALUES', 'valueType': 'Fractional', 'description': ''}
[None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]
# Get a rule definition by name
rule_max = connection.get_rule("HAS_MAX")
print(rule_max.to_dict())
{'name': 'HAS_MAX', 'predicate': 'VALUE', 'valueType': 'Fractional', 'description': 'A rule that asserts on the max of the feature'}

Create Expectations based on Hopsworks rules

Expectations are created at the feature store level. Multiple expectations can be created per feature store.

An expectation is comprised from one or multiple rules and can refer to one or multiple features. An expectation can be utilized by attaching it to a feature group, as shown in the next sections

expectation_sales = fs.create_expectation("sales",
                                          description="min and max sales limits",
                                          features=["salary", "commission"], 
                                          rules=[Rule(name="HAS_MIN", level="WARNING", min=0), Rule(name="HAS_MAX", level="ERROR", max=1000000)])
expectation_sales.save()

expectation_year = fs.create_expectation("year",
                                         features=["year"], 
                                         description="validate year correctness",
                                         rules=[Rule(name="HAS_MIN", level="ERROR", min=2018), Rule(name="HAS_MAX", level="WARNING", max=2021)])

expectation_year.save()
expectation.rules[0].to_dict(){'name': 'HAS_MIN', 'level': 'WARNING', 'min': 0, 'max': None, 'pattern': None, 'acceptedType': None, 'legalValues': None}
ExpectationsApi.expectation.to_dict(){'name': 'sales', 'description': 'min and max sales limits', 'features': ['salary', 'commission'], 'rules': [<hsfs.rule.Rule object at 0x7fa9efe54850>, <hsfs.rule.Rule object at 0x7fa9efe54690>]}
ExpectationsApi.expectation.rules[0].to_dict(){'name': 'HAS_MIN', 'level': 'WARNING', 'min': 0, 'max': None, 'pattern': None, 'acceptedType': None, 'legalValues': None}
ExpectationsApi.expectation.payload{"name": "sales", "description": "min and max sales limits", "features": ["salary", "commission"], "rules": [{"name": "HAS_MIN", "level": "WARNING", "min": 0, "max": null, "pattern": null, "acceptedType": null, "legalValues": null}, {"name": "HAS_MAX", "level": "ERROR", "min": null, "max": 1000000, "pattern": null, "acceptedType": null, "legalValues": null}]}
expectation.rules[0].to_dict(){'name': 'HAS_MIN', 'level': 'ERROR', 'min': 2018, 'max': None, 'pattern': None, 'acceptedType': None, 'legalValues': None}
ExpectationsApi.expectation.to_dict(){'name': 'year', 'description': 'validate year correctness', 'features': ['year'], 'rules': [<hsfs.rule.Rule object at 0x7fa9e0f779d0>, <hsfs.rule.Rule object at 0x7fa9efe7ad50>]}
ExpectationsApi.expectation.rules[0].to_dict(){'name': 'HAS_MIN', 'level': 'ERROR', 'min': 2018, 'max': None, 'pattern': None, 'acceptedType': None, 'legalValues': None}
ExpectationsApi.expectation.payload{"name": "year", "description": "validate year correctness", "features": ["year"], "rules": [{"name": "HAS_MIN", "level": "ERROR", "min": 2018, "max": null, "pattern": null, "acceptedType": null, "legalValues": null}, {"name": "HAS_MAX", "level": "WARNING", "min": null, "max": 2021, "pattern": null, "acceptedType": null, "legalValues": null}]}

Discover Feature Store Expectations

Using the Python API you can easily find out which expectations are availeble in this feature store.

# Get all Feature Store expectations
fs_expectations = fs.get_expectations()
[print(expectation.to_dict()) for expectation in fs_expectations]
{'name': 'year', 'description': 'validate year correctness', 'features': ['year'], 'rules': [{'level': 'ERROR', 'min': 2018.0, 'name': 'HAS_MIN'}, {'level': 'WARNING', 'max': 2021.0, 'name': 'HAS_MAX'}]}
{'name': 'sales', 'description': 'min and max sales limits', 'features': ['salary', 'commission'], 'rules': [{'level': 'WARNING', 'min': 0.0, 'name': 'HAS_MIN'}, {'level': 'ERROR', 'max': 1000000.0, 'name': 'HAS_MAX'}]}
[None, None]
# Get an expectation by its unique name
fs_expectation = fs.get_expectation("year")
print(fs_expectation.to_dict())
{'name': 'year', 'description': 'validate year correctness', 'features': ['year'], 'rules': [{'level': 'ERROR', 'min': 2018.0, 'name': 'HAS_MIN'}, {'level': 'WARNING', 'max': 2021.0, 'name': 'HAS_MAX'}]}

Create feature group with expectations and validation type

Feature store expectations can be attached and detached from feature groups. That enables ingestions pipelines to validate incoming data against expectations. Expectations can be set when creating a feature group. Later in the notebook we describe the possible validation type values and what that means for the feature group ingestion. For the moment, we initialize the validation type to STRICT

economy_fg = fs.create_feature_group(
    name = "economy_fg_p43",
    description = "Hudi Household Economy Feature Group",
    version=1,
    primary_key = ["id"], 
    partition_key = ["year"], 
    time_travel_format = "HUDI",
    validation_type="STRICT",
    expectations= [expectation_sales, expectation_year]
)

Bulk insert data into the feature group

Since we have not yet saved any data into newly created feature groups we will use Apache hudi terminology and Bulk Insert data. In HSFS its just issuing save method.

Data will be validated prior to being persisted into the Feature Store.

economy_fg.save(economy_bulk_insert_df)
<hsfs.feature_group.FeatureGroup object at 0x7fa9efe79f50>

Attach expectations to Feature Groups

Expectations can be attached and detached from feature groups even after the latter are created. If an expectation is attached to a feature group, it will be used when inserted data is validated. An expectation can be attached to multiple feature groups, as long as the expectation’s features exist in that feature group.

# Detach expectation by using its name or the metadata object, example shows the latter
economy_fg.detach_expectation(expectation_year)
# Attach expectation by using its name or the metadata object, example shows the former
economy_fg.attach_expectation(expectation_year)

Validations

You can also validate the dataframe without having to insert the data into a feature group

economy_fg.validate(economy_bulk_insert_df)
<hsfs.feature_group_validation.FeatureGroupValidation object at 0x7fa9f0067290>

You get retrieve all the validations of a feature group

economy_fg_validations = economy_fg.get_validations()
[print(validation.to_dict()) for validation in economy_fg_validations]
{'validationId': 15, 'validationTime': 1612890460069, 'expectationResults': [{'expectation': {'features': ['salary', 'commission'], 'rules': [{'level': 'WARNING', 'min': 0.0, 'name': 'HAS_MIN'}, {'level': 'ERROR', 'max': 1000000.0, 'name': 'HAS_MAX'}], 'description': 'min and max sales limits', 'name': 'sales'}, 'results': [{'feature': 'salary', 'message': 'Success', 'rule': {'level': 'ERROR', 'max': 1000000.0, 'name': 'HAS_MAX'}, 'status': 'SUCCESS', 'value': '140893.765625'}, {'feature': 'commission', 'message': 'Success', 'rule': {'level': 'ERROR', 'max': 1000000.0, 'name': 'HAS_MAX'}, 'status': 'SUCCESS', 'value': '52593.62890625'}, {'feature': 'salary', 'message': 'Success', 'rule': {'level': 'WARNING', 'min': 0.0, 'name': 'HAS_MIN'}, 'status': 'SUCCESS', 'value': '20000.0'}, {'feature': 'commission', 'message': 'Success', 'rule': {'level': 'WARNING', 'min': 0.0, 'name': 'HAS_MIN'}, 'status': 'SUCCESS', 'value': '0.0'}], 'status': 'SUCCESS'}, {'expectation': {'features': ['year'], 'rules': [{'level': 'ERROR', 'min': 2018.0, 'name': 'HAS_MIN'}, {'level': 'WARNING', 'max': 2021.0, 'name': 'HAS_MAX'}], 'description': 'validate year correctness', 'name': 'year'}, 'results': [{'feature': 'year', 'message': 'Success', 'rule': {'level': 'ERROR', 'min': 2018.0, 'name': 'HAS_MIN'}, 'status': 'SUCCESS', 'value': '2020.0'}, {'feature': 'year', 'message': 'Success', 'rule': {'level': 'WARNING', 'max': 2021.0, 'name': 'HAS_MAX'}, 'status': 'SUCCESS', 'value': '2020.0'}], 'status': 'SUCCESS'}]}
{'validationId': 16, 'validationTime': 1612890570580, 'expectationResults': [{'expectation': {'features': ['salary', 'commission'], 'rules': [{'level': 'WARNING', 'min': 0.0, 'name': 'HAS_MIN'}, {'level': 'ERROR', 'max': 1000000.0, 'name': 'HAS_MAX'}], 'description': 'min and max sales limits', 'name': 'sales'}, 'results': [{'feature': 'salary', 'message': 'Success', 'rule': {'level': 'ERROR', 'max': 1000000.0, 'name': 'HAS_MAX'}, 'status': 'SUCCESS', 'value': '140893.765625'}, {'feature': 'commission', 'message': 'Success', 'rule': {'level': 'ERROR', 'max': 1000000.0, 'name': 'HAS_MAX'}, 'status': 'SUCCESS', 'value': '52593.62890625'}, {'feature': 'salary', 'message': 'Success', 'rule': {'level': 'WARNING', 'min': 0.0, 'name': 'HAS_MIN'}, 'status': 'SUCCESS', 'value': '20000.0'}, {'feature': 'commission', 'message': 'Success', 'rule': {'level': 'WARNING', 'min': 0.0, 'name': 'HAS_MIN'}, 'status': 'SUCCESS', 'value': '0.0'}], 'status': 'SUCCESS'}, {'expectation': {'features': ['year'], 'rules': [{'level': 'ERROR', 'min': 2018.0, 'name': 'HAS_MIN'}, {'level': 'WARNING', 'max': 2021.0, 'name': 'HAS_MAX'}], 'description': 'validate year correctness', 'name': 'year'}, 'results': [{'feature': 'year', 'message': 'Success', 'rule': {'level': 'ERROR', 'min': 2018.0, 'name': 'HAS_MIN'}, 'status': 'SUCCESS', 'value': '2020.0'}, {'feature': 'year', 'message': 'Success', 'rule': {'level': 'WARNING', 'max': 2021.0, 'name': 'HAS_MAX'}, 'status': 'SUCCESS', 'value': '2020.0'}], 'status': 'SUCCESS'}]}
[None, None]

… or retrieve a validation by validation or commit time.

Validation time is the timestamp when the validation started.

Commit time is the time data was peristed in the time travel enabled feature group

commit_time = economy_fg.get_validations()[0].commit_time
validation_time = economy_fg.get_validations()[0].validation_time
# Get validation by validation time
validation = economy_fg.get_validations(validation_time=validation_time)[0]
print(validation.to_dict())
{'validationId': 15, 'validationTime': 1612890460069, 'expectationResults': [{'expectation': {'features': ['salary', 'commission'], 'rules': [{'level': 'WARNING', 'min': 0.0, 'name': 'HAS_MIN'}, {'level': 'ERROR', 'max': 1000000.0, 'name': 'HAS_MAX'}], 'description': 'min and max sales limits', 'name': 'sales'}, 'results': [{'feature': 'salary', 'message': 'Success', 'rule': {'level': 'ERROR', 'max': 1000000.0, 'name': 'HAS_MAX'}, 'status': 'SUCCESS', 'value': '140893.765625'}, {'feature': 'commission', 'message': 'Success', 'rule': {'level': 'ERROR', 'max': 1000000.0, 'name': 'HAS_MAX'}, 'status': 'SUCCESS', 'value': '52593.62890625'}, {'feature': 'salary', 'message': 'Success', 'rule': {'level': 'WARNING', 'min': 0.0, 'name': 'HAS_MIN'}, 'status': 'SUCCESS', 'value': '20000.0'}, {'feature': 'commission', 'message': 'Success', 'rule': {'level': 'WARNING', 'min': 0.0, 'name': 'HAS_MIN'}, 'status': 'SUCCESS', 'value': '0.0'}], 'status': 'SUCCESS'}, {'expectation': {'features': ['year'], 'rules': [{'level': 'ERROR', 'min': 2018.0, 'name': 'HAS_MIN'}, {'level': 'WARNING', 'max': 2021.0, 'name': 'HAS_MAX'}], 'description': 'validate year correctness', 'name': 'year'}, 'results': [{'feature': 'year', 'message': 'Success', 'rule': {'level': 'ERROR', 'min': 2018.0, 'name': 'HAS_MIN'}, 'status': 'SUCCESS', 'value': '2020.0'}, {'feature': 'year', 'message': 'Success', 'rule': {'level': 'WARNING', 'max': 2021.0, 'name': 'HAS_MAX'}, 'status': 'SUCCESS', 'value': '2020.0'}], 'status': 'SUCCESS'}]}
# Get validation by commit time
validation = economy_fg.get_validations(commit_time=commit_time)[0]
print(validation.to_dict())
{'validationId': 15, 'validationTime': 1612890460069, 'expectationResults': [{'expectation': {'features': ['salary', 'commission'], 'rules': [{'level': 'WARNING', 'min': 0.0, 'name': 'HAS_MIN'}, {'level': 'ERROR', 'max': 1000000.0, 'name': 'HAS_MAX'}], 'description': 'min and max sales limits', 'name': 'sales'}, 'results': [{'feature': 'salary', 'message': 'Success', 'rule': {'level': 'ERROR', 'max': 1000000.0, 'name': 'HAS_MAX'}, 'status': 'SUCCESS', 'value': '140893.765625'}, {'feature': 'commission', 'message': 'Success', 'rule': {'level': 'ERROR', 'max': 1000000.0, 'name': 'HAS_MAX'}, 'status': 'SUCCESS', 'value': '52593.62890625'}, {'feature': 'salary', 'message': 'Success', 'rule': {'level': 'WARNING', 'min': 0.0, 'name': 'HAS_MIN'}, 'status': 'SUCCESS', 'value': '20000.0'}, {'feature': 'commission', 'message': 'Success', 'rule': {'level': 'WARNING', 'min': 0.0, 'name': 'HAS_MIN'}, 'status': 'SUCCESS', 'value': '0.0'}], 'status': 'SUCCESS'}, {'expectation': {'features': ['year'], 'rules': [{'level': 'ERROR', 'min': 2018.0, 'name': 'HAS_MIN'}, {'level': 'WARNING', 'max': 2021.0, 'name': 'HAS_MAX'}], 'description': 'validate year correctness', 'name': 'year'}, 'results': [{'feature': 'year', 'message': 'Success', 'rule': {'level': 'ERROR', 'min': 2018.0, 'name': 'HAS_MIN'}, 'status': 'SUCCESS', 'value': '2020.0'}, {'feature': 'year', 'message': 'Success', 'rule': {'level': 'WARNING', 'max': 2021.0, 'name': 'HAS_MAX'}, 'status': 'SUCCESS', 'value': '2020.0'}], 'status': 'SUCCESS'}]}

Get the status of a validation

print("Validation status: {}".format(validation.status))
Validation status: SUCCESS

Upsert new invalid data into a Feature Group

Now we will try to upsert some invalid data (year feature does not meet the maximum expectation). An error is returned to the client along with the failed expectation

economy_upsert_data = [
    Row(1, 120499.73, 0.0, "car17", 205000.0, 30, 564724.18, 2022),    #update
    Row(2, 160893.77, 0.0, "car10", 179000.0, 2, 455015.33, 2020),     #update
    Row(5, 93956.32, 0.0, "car15",  135000.0, 1, 458679.82, 2020),     #insert
    Row(6, 41365.43, 52809.15, "car7", 135000.0, 19, 216839.71, 2020), #insert
    Row(7, 94805.61, 0.0, "car17", 135000.0, 23, 233216.07, 2022)      #insert    
]

economy_upsert_df = spark.createDataFrame(economy_upsert_data, economy_fg_schema)

economy_upsert_df.show(5)
+---+---------+----------+-----+--------+------+---------+----+
| id|   salary|commission|  car|  hvalue|hyears|     loan|year|
+---+---------+----------+-----+--------+------+---------+----+
|  1|120499.73|       0.0|car17|205000.0|    30| 564724.2|2022|
|  2|160893.77|       0.0|car10|179000.0|     2|455015.34|2020|
|  5| 93956.32|       0.0|car15|135000.0|     1| 458679.8|2020|
|  6| 41365.43|  52809.15| car7|135000.0|    19| 216839.7|2020|
|  7| 94805.61|       0.0|car17|135000.0|    23|233216.06|2022|
+---+---------+----------+-----+--------+------+---------+----+
# Insert call will fail as invalid data (year feature) is about to be ingested. Error shows the expectation that was not met
economy_fg.insert(economy_upsert_df)
An error was encountered:
Metadata operation error: (url: https://hopsworks.glassfish.service.consul:8182/hopsworks-api/api/project/150/featurestores/98/featuregroups/64/validations). Server response: 
HTTP code: 417, HTTP reason: Expectation Failed, error code: 270149, error msg: Feature group validation checks did not pass, will not persist validation results., user msg: Results: [ExpectationResult{status=Failure, results=[ValidationResult{status=Success, message='Success', value='2020.0', feature='year', rule=Rule{name=HAS_MIN, level=ERROR, min=2018.0, max=null, pattern='null', acceptedType=null, legalValues=null}}, ValidationResult{status=Failure, message='Value: 2022.0 does not meet the constraint requirement! HAS_MAX', value='2022.0', feature='year', rule=Rule{name=HAS_MAX, level=WARNING, min=null, max=2021.0, pattern='null', acceptedType=null, legalValues=null}}], expectation=Expectation{name='year', features=[year], rules=[Rule{name=HAS_MIN, level=ERROR, min=2018.0, max=null, pattern='null', acceptedType=null, legalValues=null}, Rule{name=HAS_MAX, level=WARNING, min=null, max=2021.0, pattern='null', acceptedType=null, legalValues=null}]}}]
Traceback (most recent call last):
  File "/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/hsfs/feature_group.py", line 690, in insert
    write_options,
  File "/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/hsfs/core/feature_group_engine.py", line 93, in insert
    validation = feature_group.validate(feature_dataframe)
  File "/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/hsfs/feature_group.py", line 857, in validate
    return self._data_validation_engine.validate(self, dataframe)
  File "/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/hsfs/core/data_validation_engine.py", line 115, in validate
    return self._feature_group_validation_api.put(feature_group, validation_python)
  File "/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/hsfs/core/validations_api.py", line 49, in put
    data=feature_group_validation.json(),
  File "/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/hsfs/decorators.py", line 35, in if_connected
    return fn(inst, *args, **kwargs)
  File "/srv/hops/anaconda/envs/theenv/lib/python3.7/site-packages/hsfs/client/base.py", line 147, in _send_request
    raise exceptions.RestAPIError(url, response)
hsfs.client.exceptions.RestAPIError: Metadata operation error: (url: https://hopsworks.glassfish.service.consul:8182/hopsworks-api/api/project/150/featurestores/98/featuregroups/64/validations). Server response: 
HTTP code: 417, HTTP reason: Expectation Failed, error code: 270149, error msg: Feature group validation checks did not pass, will not persist validation results., user msg: Results: [ExpectationResult{status=Failure, results=[ValidationResult{status=Success, message='Success', value='2020.0', feature='year', rule=Rule{name=HAS_MIN, level=ERROR, min=2018.0, max=null, pattern='null', acceptedType=null, legalValues=null}}, ValidationResult{status=Failure, message='Value: 2022.0 does not meet the constraint requirement! HAS_MAX', value='2022.0', feature='year', rule=Rule{name=HAS_MAX, level=WARNING, min=null, max=2021.0, pattern='null', acceptedType=null, legalValues=null}}], expectation=Expectation{name='year', features=[year], rules=[Rule{name=HAS_MIN, level=ERROR, min=2018.0, max=null, pattern='null', acceptedType=null, legalValues=null}, Rule{name=HAS_MAX, level=WARNING, min=null, max=2021.0, pattern='null', acceptedType=null, legalValues=null}]}}]

Validation type

The validation type determines the validation behavior. Available types are: - STRICT: Data validation is performed and data is ingested into feature group is updated only if validation status is “SUCCESS” - WARNING: Data validation is performed and data is ingested into the feature group only if validation status is “WARNING” or “SUCCESS” - ALL: Data validation is performed and data is ingested into the feature group regardless of the validation status - NONE: Data validation not performed on feature group

The validation type can easily be changed for a feature group

# The previous economy_upsert_df contains invalid data but we still want to persist the data, so we set the validation type from STRICT to ALL
economy_fg.validation_type = "ALL"
# We try to insert the invalid df again
economy_fg.insert(economy_upsert_df)