Azure ML Feature Store Tour

Azure ML and the Hopsworks Feature Store

The Hopsworks Feature Store is an open platform that connects to the largest number of data stores, and data science platforms with the most comprehensive API support - Python, Spark (Python, Java/Scala). It supports Azure ML Studio Notebooks or Designer for feature engineering and as your data science platform. You can design and ingest features and you can browse existing features, along with creating training datasets as either DataFrames or as files on Azure Blob storage.

Prerequisites

In order to follow this tutorial, you need:

  • Hopsworks Feature Store running on https://hopsworks.ai. You can register for free with no credit-card and receive $4000 USD of credits to get started. You can deploy a feature store in either your own Azure account or even in an AWS account.
  • Users should also have an existing ML Studio environment with an attached compute cluster. You can upload this notebook and attach your compute to it.
  • If you want to follow this tutorial with the same data, make sure to upload these files to your ML Studio environment.
  • A project created within Hopsworks. If you don’t have one yet, you can simply follow the Feature Store tour that creates a sample project for you.

Step 1: Configure a Hopsworks API Key

Connecting to the Feature Store from Azure ML requires setting up a Feature Store API key for authentication.

In Hopsworks, click on your username in the top-right corner (1) and select Settings to open the user settings. Select API keys. (2) Give the key a name and select the job, featurestore, dataset.create and project scopes before (3) creating the key. Copy the key into your clipboard for the next step.

Step 2: Connect from an Azure Machine Learning Notebook

To access the Feature Store from Azure Machine Learning, proceed with the following steps to install the Hopsworks Feature Store client called HSFS:

!pip install hsfs[hive]

Note that we are installing the latest version at the time of writing this (2.1.4) - you should always install the latest minor version that corresponds to the version of your Hopsworks Feature Store. So in this case our Hopsworks instance is running version 2.1. Furthermore, for Python clients (such as Azure ML), it is important to install HSFS with the [hive] optional extra. Spark clients do not need this.

After successfully installing HSFS, you should be able to connect to the Feature Store from your Azure ML notebook (note: you might need to restart the kernel, if you had HSFS previously installed):

import hsfs

# TODO: replace the values below: [UUID], [project-name], [api-key]
connection = hsfs.connection(host="[UUID].cloud.hopsworks.ai",
    project="[project-name]",
    engine="hive",
    api_key_value="[api-key]")

fs = connection.get_feature_store()

Make sure to replace the [UUID] with the one of the DNS of your Hopsworks instance, the [project-name] with the Hopsworks project that contains your feature store. And the [api-key] with the key created in Step 1. Please note that it’s not good practice to store the Api Key in your notebook- instead you should store the key safely in a permissions protected file and use the “api_key_file” argument to pass the filename to the connection method. Once you are connected you can get a handle to the feature store with connection.get_feature_store(). If the project you have connected to also contains a shared feature store (it is possible to have a feature store from another project shared with the project you are using), you can also get a handle on the shared feature store using the connection object.

Step 3: Ingest data from a Pandas dataframe to the Feature Store

You can simply upload some data in your favourite file format to the Azure ML workspace or you configure a Hopsworks Storage Connector to cloud storage or a database. The Storage Connector safely stores endpoints and credentials to external stores or databases, making it easier for Data Scientists to retrieve data from them.

If you opted to upload the data as CSV files, as shown below, simply read it into a pandas dataframe:

import pandas as pd

sales_csv = pd.read_csv("sales data-set.csv")
sales_csv

Now, we can perform some feature engineering based on the pandas dataframe. We would like to predict the weekly sales of a department, so let’s create our target feature by selecting the last week available for each department:

sales_csv["date"] = pd.to_datetime(sales_csv["date"])
sales_csv.sort_values(["store", "dept", "date"], inplace=True)
target_df = sales_csv.groupby(["store", "dept"]).last().reset_index()
target_df

We can create this as a feature group, also containing the is_holiday feature, since, this information will be available at prediction time, there is no risk of data leakage.

fg_target = fs.create_feature_group("weekly_sales_target_hudi",
    version=1,
    description="containing the latest weekly sales of each store/department",
    primary_key=["store", "dept"],
    time_travel_format="HUDI",
    statistics_config={"enabled": True, "correlations": True, "histograms": True, "exact_uniqueness": True})

fg_target.save(target_df)

Let’s now create a few simple features based on the historical sales of each department:

import numpy as np
df = pd.merge(sales_csv, target_df[["store", "dept", "date"]], on=["store", "dept"], how="left")
hist_df = df[df["date_x"] != df["date_y"]]
hist_df["holiday_flag"] = df['is_holiday'].apply(lambda x: 1 if x else 0) 
hist_df["non_holiday_flag"] = df['is_holiday'].apply(lambda x: 0 if x else 1)
hist_df["holiday_week_sales"] = hist_df["holiday_flag"] * hist_df["weekly_sales"]
hist_df["non_holiday_week_sales"] = hist_df["non_holiday_flag"] * hist_df["weekly_sales"]
total_features = hist_df.groupby(["store", "dept"]).agg(
    {"weekly_sales": [sum, np.mean],
     "date_x": pd.Series.nunique,
     "holiday_week_sales": sum,
     "non_holiday_week_sales": sum})
total_features.columns = ['_'.join(col).strip() for col in total_features.columns.values]
total_features.reset_index(inplace=True)

And again, we finish by creating a feature group with this dataframe and saving it to the feature store:

weekly_sales_total = fs.create_feature_group("weekly_sales_total_hudi",
    version=1,
    description="containing the total historical sales and weekly average of each store/department",
    primary_key=["store", "dept"],
    time_travel_format="HUDI",
    statistics_config={"enabled": True, "correlations": True, "histograms": True, "exact_uniqueness": True})

weekly_sales_total.save(total_features)

Note: If you have existing feature engineering notebooks that you would like to reuse with the Hopsworks Feature Store, it should be enough to simply add the two calls (create the Feature Group, and save the dataframe to it) in order to ingest your features to the Feature Store. No other changes are required in your existing programs and you can still use your favourite Python libraries for feature engineering. With these two feature groups we can move to the next step to create a training dataset. Since we did not disable statistics computation, you can head to the Hopsworks Feature Store and inspect the pre-computed statistics over the newly created feature groups.

Step 4: Create a training dataset in your favorite file format using the Feature Store

HSFS comes with an expressive Join API and Query Planner that allows users to join, filter and explore feature groups in order to create training datasets. Assuming, you start with a new Jupyter Notebook, the first commands you need to run are to get handles to the previously created feature groups:

target_fg = fs.get_feature_group("weekly_sales_target", version=1)
sales_fg = fs.get_feature_group("weekly_sales_total", version=1)

Note that we explicitly supply the (schema) version for the feature group (version=1), so that other developers can update the feature groups safely in higher numbered versions of the feature group. With our two feature group objects, we would like to join the target feature with our historical features, but only select the departments for our training dataset that have a full history of 142 weeks available:

td_query = target_fg.select(["weekly_sales", "is_holiday"]) \
    .join(sales_fg.filter(sales_fg.date_x_nunique == 142))

td_query.show(5)

As you can see, feature group joins work similarly to pandas dataframe joins. In this case we can omit the join-key since both feature groups have the same primary key, however, for more advanced joins there is always the possibility to specify the join key from each group as well as the join type (left, inner, right, outer, etc) manually. Hopsworks Feature Store supports a variety of storage connectors to materialize your training dataset to different cloud storage systems. If you have previously configured an Azure Data Lake Storage connector, you can now use it as the destination for your training dataset:

# storage = fs.get_storage_connector("ADLS")

Similar to feature groups, you can now create the training dataset in your favourite file format, matching the machine learning library you are planning to use - for example, choose ‘tfrecord’ for TensorFlow. The Feature Store will make sure to track all metadata related to your training dataset, even if the training dataset is created outside of Hopsworks.

td = fs.create_training_dataset("weekly_sales_model",
    version=1,
    data_format="tfrecord",
    splits={"train": 0.8, "test": 0.2},
    seed=12,
    #storage_connector=storage,
    label=["weekly_sales"])

td.save(td_query)

To retrieve the training dataset in your training environment you can simply get a handle to the dataset and its location, to pass it subsequently to your reader utilities:

td = fs.get_training_dataset("weekly_sales_model", version=1)
td.location

Next Steps

Head over to documentation and learn more about the capabilities of the HSFS client libraries.