Sagemaker Example

AWS SageMaker and the Hopsworks Feature Store

The Hopsworks Feature Store is an open platform that connects to the largest number of data stores, and data science platforms with the most comprehensive API support - Python, Spark (Python, Java/Scala). It supports AWS SageMaker for feature engineering and as your data science platform. You can design and ingest features and you can browse existing features, along with creating training datasets as either DataFrames or as files on AWS S3.

Prerequisites

In order to follow this tutorial, you need:

  • Hopsworks Feature Store running on https://hopsworks.ai. You can register for free with no credit-card and receive $4000 USD of credits to get started. You can deploy a feature store in either your own Azure account or even in an AWS account.
  • Users should also have an existing SageMaker environment.
  • The feature store tour project within Hopsworks. The notebook is designed to be used in combination with the Feature Store Tour on Hopsworks.

Step 1: Configure a Hopsworks API Key

Connecting to the Feature Store from AWS SageMaker requires setting up a Feature Store API key for authentication.

In Hopsworks, click on your username in the top-right corner (1) and select Settings to open the user settings. Select API keys. (2) Give the key a name and select the job, featurestore, dataset.create and project scopes before (3) creating the key. Copy the key into your clipboard for the next step.

Step 2: Connect from an Azure Machine Learning Notebook

To access the Feature Store from an AWS SageMaker notebook, proceed with the following steps to install the Hopsworks Feature Store client called HSFS:

!pip install hsfs[hive]==2.1.4

Note that we are installing the latest version at the time of writing this (2.1.4) - you should always install the latest minor version that corresponds to the version of your Hopsworks Feature Store. So in this case our Hopsworks instance is running version 2.1. Furthermore, for Python clients (such as AWS SageMaker), it is important to install HSFS with the [hive] optional extra. Spark clients do not need this.

After successfully installing HSFS, you should be able to connect to the Feature Store from your SageMaker notebook (note: you might need to restart the kernel, if you had HSFS previously installed):

import hsfs

connection = hsfs.connection(host="[UUID].cloud.hopsworks.ai",
    project="[project-name]",
    engine="hive",
    api_key_value="[api-key]")

fs = connection.get_feature_store()

Make sure to replace the [UUID] with the one of the DNS of your Hopsworks instance, the [project-name] with the Hopsworks project that contains your feature store. And the [api-key] with the key created in Step 1. Please note that it’s not good practice to store the Api Key in your notebook - instead you should store the key safely in a permissions protected file and use the “api_key_file” argument to pass the filename to the connection method. Once you are connected you can get a handle to the feature store with connection.get_feature_store(). If the project you have connected to also contains a shared feature store (it is possible to have a feature store from another project shared with the project you are using), you can also get a handle on the shared feature store using the connection object.

Step 3: Retrieve features from a Feature Group as a Pandas DataFrame

Entities within the Feature Store are organized hierarchically. On the most granular level are the features itself. Data Engineers ingest the feature data within their organization through the creation of feature groups. Data Scientists are then able to read selected features from the feature groups to create training datasets for model training, run batch inference with deployed models or perform inference from online models by scoring single feature vectors.

Fetch the Feature Group metadata

To fetch a feature group from the Feature Store you will first have to fetch its metadata, which is represented by a FeatureGroup object:

teams = fs.get_feature_group("teams_features", version=1)

Note: This operation is lazy and doesn’t fetch any feature data yet.

You can inspect the metadata of the feature group through it’s properties:

print(teams.name)
print(teams.created)

Read Feature Group into a Pandas DataFrame

teams_df = fg.read()

Fetch an Individual Feature

The idea of the Feature Store is to have pre-computed features available for both training and serving models. The key functionality required to generate training datasets from reusable features are: feature selection, joins, filters and point in time queries. To enable this functionality, we are introducing a new expressive Query abstraction with HSFS that provides these operations and guarantees reproducible creation of training datasets from features in the Feature Store.

To read an individual feature, you can call the .select() method on a feature group with the respective name of the desired feature:

query = teams.select(["team_budget"])

Most operations, such as the .select() above, but also .filter() and .join(), which are performed on a feature group object return a Query object. These Querys can then be used in further joins or with further filters to create training datasets or to be read into a dataframe.

team_budget_df = query.read()

# or preview first ten lines
query.show(10)

Fetch A Set of Features

You can also query a set of features from a feature group:

query = teams.select(["team_id", "team_budget"])

Or simply select all:

query = teams.select_all()

Joins and Advanced Eamples

We are going to create a query from three different feature groups with twelve features in total to demonstrate joins.

First, we need references to the two remaining feature groups:

players = fs.get_feature_group("players_features", 1)
attendances = fs.get_feature_group("attendances_features", 1)

Now we can select the desired features and join them. Note how me make use of the on argument for the join only for one of the joins, if no join key is specified, the feature store will use the maximum matching subset of the primary key of both feature groups.

query = teams.select(["team_budged", "team_position"]).join(players.select_all()) \
                                                      .join(attendances.select(["sum_attendance", "average_attendance"]), on=["team_id"])

Again we can take a look at the resulting dataset to verify:

query.show(10)

Step 4: Create a training dataset in your favorite file format using the Feature Store

HSFS comes with an expressive Join API and Query Planner that allows users to join, filter and explore feature groups in order to create training datasets.

From AWS SageMaker, you can use these queries to generate training datasets in your desired file format in the Hopsworks Feature Store or on a cloud storage service of your choice, such as S3, by using a Storage Connector.

Note that we explicitly supply the (schema) version for the feature groups (version=1), when retrieving them, so that other developers can update the feature groups safely in newer versions of the feature group.

We want to extend the previous query, because we want to train a model only for teams that have a position higher than 25. So we add a filter to the teams feature group.

query = teams.select(["team_budged", "team_position"]).filter(teams.team_position >= 25) \
                                                      .join(players.select_all()) \
                                                      .join(attendances.select(["sum_attendance", "average_attendance"]), on=["team_id"])

You can access the features of a feature group through the .-dot notation, but you can also use the .get_feature() method of the feature group. In comparision to the .select(), .get_feature() returns you a Feature object, similar to the FeatureGroup meta data object, while the .select() returns a `Query.

Furthermore, as you can see, feature group joins work similarly to pandas dataframe joins. In this case we can omit the join-key since both feature groups have the same primary key, however, for more advanced joins there is always the possibility to specify the join key from each group as well as the join type (left, inner, right, outer, etc) manually.

Hopsworks Feature Store supports a variety of storage connectors to materialize your training dataset to different cloud storage systems. If you have previously configured an Azure Data Lake Storage connector, you can now use it as the destination for your training dataset:

Similar to feature groups, you can now create the training dataset in your favourite file format, matching the machine learning library you are planning to use - for example, choose ‘tfrecord’ for TensorFlow. The Feature Store will make sure to track all metadata related to your training dataset, even if the training dataset is created outside of Hopsworks.

td = fs.create_training_dataset("teams_model",
    version=1,
    data_format="tfrecord",
    splits={"train": 0.8, "test": 0.2},
    seed=12,
    label=["weekly_sales"])

td.save(query)

To retrieve the training dataset in your training environment you can simply get a handle to the dataset and its location, to pass it subsequently to your reader utilities:

td = fs.get_training_dataset("weekly_sales_model", version=1)
td.location

Next Steps

Head over to documentation and learn more about the capabilities of the HSFS client libraries.