Feature Ingestion from S3 - Sacramento Housing

Ingest Sacramento Housing data from a S3 bucket to the Feature Store

First, download this sample data from here - and upload it into a S3 bucket.

Before starting with the execution, you should also create a S3 storage connector pointing to the bucket where you uploaded the data. You can follow the Storage Connectors documentation to see how you can create the storage connector from the feature store UI.

Import already feature engineered data from S3

In this section we are going to assume that the feature engineering process has already happended outside Hopsworks. In other words, the data in S3 is already feature engineered and we only want to import it into the feature store to be made available to data scientistis.

You first need an IAM Role

You will need an IAM role to be able to read data from a S3 bucket. In Hopsworks, there are two ways of assuming an IAM role for the notebooks/jobs that you run in Hopsworks: 1. you can assign an Instance Profile to the Hopsworks cluster when you create it and all users share its IAM Role, and 2. you can assign multiple IAM Roles to a Hopsworks Cluster, and then decide which Projects and its users can assume which IAM Role.

Cluster-wide IAM Role

On hopsworks.ai, when you are configuring your Hopsworks cluster, you can select an Instance profile for Hopsworks - see the screenshot below.. All jobs run on Hopsworks can use the IAM Role for this Instance profile (the Instance profile is an IAM Role for this instance). That is, all Hopsworks users share the Instance Profile role and the resource access policies attached to that role.

Cluster-wide IAM Profile

Federated IAM Roles (Role Chaining)

You can restrict a IAM Roles to be only usable within a specified project. Within the specified project, you can furuther retrict which role a user must have to be able to use the IAM Role - e.g., only Data Owners in the project called Noc-list can use this assume IAM role. See details on how to setup multiple IAM Roles (Role Chaining) in our documentation.

import hsfs
connection = hsfs.connection()
fs = connection.get_feature_store()
# You can also read from a bucket with your IAM Role without a storage connector
df = spark.read_csv("s3a://sacramento_houses_raw/sacramento_houses_raw.csv")
df.show(5)
housing_fg = fs.create_feature_group(name="housing_fg",
                                   version=1,
                                   description="FG with Sacramento Housing Data",
                                   primary_key=["latitude", "longitude"],
                                   time_travel_format=None,
                                   statistics_config={"enabled": True, "histograms": True, "correlations": True})
housing_fg.save(df)

In the feature store UI you should now be able to see that the feature group has been created, browse its schema and statistics. You can now use it to build training datasets.

Import raw data, do feature engineering and create a feature group

In the next session we are going to assume that the data in the S3 bucket is raw data that needs to be feature engineered before it can be used by data scientists to build models.

Hopsworks feature store relies on Apache Spark to provide a scalabale framework for feature engineering processing. Hopsworks allows users to write both PySpark and Scala code. To know more about how to work with Spark code in Hopsworks you can have a look at Apache Spark documentation and at the Hopsworks Jupyter documentation.

For the sake of the tutorial, in this section we are going to read the CSV file in a dataframe, convert the type feature from a string to a categorical numerical feature and write the new feature group in the feature store.

To instruct Spark to read from S3 we build the path to the file in the bucket. Please note the file system - s3a://.