Feature Store Tags

Tags

The feature store enables users to attach tags to artifacts, such as feature groups or training datasets. Tags are aditional metadata attached to your artifacts and thus they can be used for an enhanced full text search. Adding tags to an artifact provides users with a more dynamic metadata content that can be used for both storage as well as enhancing artifact discoverability.

Note: By default Hopsworks makes all metadata searchable, users can opt out for particular featurestores if they want to keep them private.

A tag is a {key: value} association, providing additional information about the data, such as for example geographic origin. This is useful in an organization as it adds more context to your data making it easier to share and discover data and artifacts.

Note: Tagging is only available in the enterprise version.

Tag Schemas

The first step is to define the schemas of tags that can later be attached to artifacts. These schemas follow the https://json-schema.org as reference and can be seen as types for jsons. The schemas define legal jsons and these can be primitives, objects or arrays. The schemas themselves are also defined as jsons.

Allowed primitive types are:

  • string
  • boolean
  • integer
  • number (float)

A tag of primitive type - string would look like:

{ "type" : "string" }

and this would allow a json value of:

string tag value

We can also define arbitrarily complex json schemas, such as:

{
  "type" : "object", 
  "properties" : 
  {
    "first_name" : { "type" : "string" },
    "last_name" : { "type" : "string" },
    "age" : { "type" : "integer" },
    "hobbies" : { 
        "type" : "array",
        "items" : { "type" : "string" }
    }
  },
  "required" : ["first_name", "last_name", "age"],
  "additionalProperties": false
}

and a value that follows this schema would be:

{ 
  "first_name" : "John",
  "last_name" : "Doe",
  "age" : 27,
  "hobbies" : ["tennis", "reading"]
}

Properties section of a tag is a dictionary that defines field names and types.

Json schema are pretty lenient, all that the properties section tells us, is that if a field appears, it should be of the appropriate type. If the json object contains the field first_name, this field cannot be of type boolean, it has to be of type string. What we emphasize here, is that the properties section does not impose that fields declared are mandatory, or that the json object cannot contain other fields that were not defined in the schemas.

Required section enforces the mandatory fields. In our case above first_name, last_name, age are declared as mandatory, while hobbies is left as an optional field.

Additional Properties section enforces the strictness of the schema. If we set this to false the json objects of this schema can only use fields that are declared (mandatoriy or optional) by the schema. No undeclared fields will be allowed.

Type object is the default type for schemas, so you can ommit it if you want to keep the schema short.

Advanced tag usage

We can use additional properties of schemas as defined by https://json-schema.org to enhance our previous person schema:

  • Add a $schema section to allow us to use more advanced features of the json schemas defined in later drafts. The default schema draft is 4 and we will use 7 here (latest).
  • Add an id field that is of type string but has to follow a particular regex pattern. We will also make this field mandatory.
  • Set some rules on age, for example age should be an Integer between 0 and 150.
  • Add an address field that is itself an object.
{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type" : "object", 
  "properties" : 
  {
    "id" : {
      "type" : "string",
      "pattern" : "^[A-Z]{2}[0-9]{4}$"
    },
    "first_name" : { "type" : "string" },
    "last_name" : { "type" : "string" },
    "age" : { 
      "type" : "integer",
      "minimum" : 0,
      "maximum" : 150
    },
    "hobbies" : { 
        "type" : "array",
        "items" : { "type" : "string" }
    },
    "address" : {
      "street" : { "type" : "string" },
      "city" : { "type" : "string" }
    }
  },
  "required" : ["id", "first_name", "last_name", "age"],
  "additionalProperties": false
}

and a valid value for this new schema would be:

{
  "id" : "AB1234",
  "first_name" : "John",
  "last_name" : "Doe",
  "age" : 27,
  "hobbies" : ["tennis", "reading"],
  "address" : {
    "street" : "Vasagatan nr. 12",
    "city" : "Stockholm"
  }
}

Notebook required schema setup

In order for this notebook to work properly you need an user with admin rights to define the following schemas.

  • Primitive string
    • name: location
    • value: {"type":"string"}
  • Complex object
    • name: person
    • value: {"$schema":"http://json-schema.org/draft-07/schema#", "type":"object","properties":{"id":{"type":"string","pattern":"^[A-Z]{2}[0-9]{4}$"},"first_name":{"type":"string"},"last_name":{"type":"string"},"age":{"type":"integer","minimum":0,"maximum":150},"hobbies":{"type":"array","items":{"type":"string"}},"address":{"street":{"type":"string"},"city":{"type":"string"}}},"required":["id","first_name","last_name","age"],"additionalProperties":false}

UI operations

Creating schemas is currently only possible from the UI by a user with admin role, since the schemas are defined cluster wide.

From the Hopsworks UI you can also attach and view tags, as well as search for artifacts by the tags contents. For more details on UI operations visit our documentation page: https://docs.hopsworks.ai/latest/generated/tags

References

For more references on the schemas check: * Our documentation on https://docs.hopsworks.ai/latest/generated/tags * The reference https://json-schema.org in order to figure out the full capabilities of json schemas

Notebook tour

Featurestore name

Change the name of the featurestore according to the project you are running from. The example was written within the project names: demo_fs_meb10000, which is the feature store demo tour.

import hsfs
connection = hsfs.connection()
fs = connection.get_feature_store(name="demo_fs_meb10000_featurestore")
Connected. Call `.close()` to terminate connection gracefully.

Creating a feature group and a training dataset

The sections used to create the feature group and the training dataset might fail if the artifacts already exist, created by a previous run of this notebook.

fg_name = 'tag_fg'
td_name = 'tag_td'

Create the feature group used in this notebook to attach tags to.

fg_data = []
fg_data.append((1, 1, 1))
fg_spark_df = spark.createDataFrame(fg_data, ['id', 'fg1_col1', 'fg1_col2'])
fg_write = fs.create_feature_group(name=fg_name, version=1, description="tags notebook feature group", primary_key=['id'], time_travel_format=None, statistics_config=False)
fg_write.save(fg_spark_df)
<hsfs.feature_group.FeatureGroup object at 0x7f5f34df3a50>
fg_read = fs.get_feature_group(fg_name)
VersionWarning: No version provided for getting feature group `tag_fg`, defaulting to `1`.

Create the training dataset used in this notebook to attach tags to.

td_query = fg_read.select_all()
td = fs.create_training_dataset(name=td_name, description="tags notebook training dataset", data_format="csv", version=1)
td.save(td_query)
<hsfs.training_dataset.TrainingDataset object at 0x7f5f34e35850>
td_read = fs.get_training_dataset(td_name, 1)

Working with tags on featuregroups

Attaching tags

Attaching a simple key-value(string) tag to your featuregroup.

Note: You can only attach one tag value for a tag name, so by calling the add operation on the same tag multiple times, you perform an update operation. If you require attaching multiple values to a tag, like maybe a sequence, consider changing the tag type to an array of the type you just defined.

tag1_name="location"
tag1_value="Sweden"
fg_read.add_tag(tag1_name, tag1_value)

Listing tags

Reading a tag value use the tag key.

fg_read.get_tag(tag1_name)
'Sweden'

Reading all the tags attached to a feature group.

fg_read.get_tags()
{'location': 'Sweden'}

Deleting tags

fg_read.delete_tag(tag1_name)

Tag is no longer in the list of attached tags, but can be re-attached at a later time.

fg_read.get_tags()
{}

Using tags with more complex values

Attaching a simple json object tag.

tag2_name="person"
tag2_value={
  "id" : "AB1234",
  "first_name" : "John",
  "last_name" : "Doe",
  "age" : 27,
  "hobbies" : ["tennis", "reading"],
  "address" : {
    "street" : "Vasagatan nr. 12",
    "city" : "Stockholm"
  }
}

fg_read.add_tag(tag2_name, tag2_value)
fg_read.get_tag(tag2_name)
{'id': 'AB1234', 'first_name': 'John', 'last_name': 'Doe', 'age': 27, 'hobbies': ['tennis', 'reading'], 'address': {'street': 'Vasagatan nr. 12', 'city': 'Stockholm'}}

Working with tags on training datasets

The API calls for attaching, reading and deleting tags are exactly the same on training datasets as they are on feature groups.

Attaching tags

td_read.add_tag(tag1_name, tag1_value)

Listing tags

td_read.get_tags()
{'location': 'Sweden'}
td_read.get_tag(tag1_name)
'Sweden'

Deleting tags

td_read.delete_tag(tag1_name)
td_read.get_tags()
{}

Cleaning up

If you want to be able to rerun the notebook with no failed paragraphs you will need to delete the feature group tag_fg and the training dataset tag_td.

connection.close()
Connection closed.