Feature Exploration

HSFS feature exploration

In this notebook we are going to walk through how to use the HSFS library to explore feature groups and features in the Hopsworks Feature Store.

A key component of the Hopsworks feature store is to enable sharing and re-using of features across models and use cases. As such, the HSFS libraries allows user to join features from different feature groups and use them to create training datasets. Features can be taken also from different feature stores (projects) as long as the user running the notebook has the read access to those.

Join

As for the feature_engineering notebook, the first step is to establish a connection with the feature store and retrieve the project feature store handle.

import hsfs
# Create a connection
connection = hsfs.connection()
# Get the feature store handle for the project's feature store
fs = connection.get_feature_store()
Starting Spark application
IDYARN Application IDKindStateSpark UIDriver log
28application_1607613578055_0015pysparkidleLinkLink
SparkSession available as 'spark'.
Connected. Call `.close()` to terminate connection gracefully.

Explore feature groups

You can interact with the feature groups as if they were Spark dataframe. A feature group object has a show() method, to show n number of lines, and a read() method to read the content of the feature group in a Spark dataframe.

The first step to do any operation on a feature group is to get its handle from the feature store. The get_feature_group method accepts the name of the feature group and an optional parameter with the version you want to select. If you do not provide a version, the APIs will default to version 1

sales_fg = fs.get_feature_group("sales_fg")
VersionWarning: No version provided for getting feature group `sales_fg`, defaulting to `1`.
sales_fg.show(5)
+-----+------------------------------+----+----------+-------------------------+--------------------------+------------+----------------------------+----------------------+------------------------+---------------------+----------+--------------------------+
|store|sales_last_six_month_store_dep|dept|is_holiday|sales_last_year_store_dep|sales_last_six_month_store|weekly_sales|sales_last_quarter_store_dep|sales_last_month_store|sales_last_quarter_store|sales_last_year_store|      date|sales_last_month_store_dep|
+-----+------------------------------+----+----------+-------------------------+--------------------------+------------+----------------------------+----------------------+------------------------+---------------------+----------+--------------------------+
|   20|                           0.0|  55|     false|                      0.0|                       0.0|    32362.95|                         0.0|                   0.0|                     0.0|                  0.0|2010-02-05|                       0.0|
|   20|                           0.0|  94|     false|                      0.0|                       0.0|    63787.83|                         0.0|                   0.0|                     0.0|                  0.0|2010-02-05|                       0.0|
|   20|                           0.0|   2|     false|                      0.0|                       0.0|    85812.69|                         0.0|                   0.0|                     0.0|                  0.0|2010-02-05|                       0.0|
|   20|                           0.0|  92|     false|                      0.0|                       0.0|   195223.84|                         0.0|                   0.0|                     0.0|                  0.0|2010-02-05|                       0.0|
|   20|                           0.0|  12|     false|                      0.0|                       0.0|     6527.56|                         0.0|                   0.0|                     0.0|                  0.0|2010-02-05|                       0.0|
+-----+------------------------------+----+----------+-------------------------+--------------------------+------------+----------------------------+----------------------+------------------------+---------------------+----------+--------------------------+
only showing top 5 rows
sales_df = sales_fg.read()
sales_df.filter("store == 20").show(5)
print(type(sales_df))
+-----+------------------------------+----+----------+-------------------------+--------------------------+------------+----------------------------+----------------------+------------------------+---------------------+----------+--------------------------+
|store|sales_last_six_month_store_dep|dept|is_holiday|sales_last_year_store_dep|sales_last_six_month_store|weekly_sales|sales_last_quarter_store_dep|sales_last_month_store|sales_last_quarter_store|sales_last_year_store|      date|sales_last_month_store_dep|
+-----+------------------------------+----+----------+-------------------------+--------------------------+------------+----------------------------+----------------------+------------------------+---------------------+----------+--------------------------+
|   20|                           0.0|  55|     false|                      0.0|                       0.0|    32362.95|                         0.0|                   0.0|                     0.0|                  0.0|2010-02-05|                       0.0|
|   20|                           0.0|  94|     false|                      0.0|                       0.0|    63787.83|                         0.0|                   0.0|                     0.0|                  0.0|2010-02-05|                       0.0|
|   20|                           0.0|   2|     false|                      0.0|                       0.0|    85812.69|                         0.0|                   0.0|                     0.0|                  0.0|2010-02-05|                       0.0|
|   20|                           0.0|  92|     false|                      0.0|                       0.0|   195223.84|                         0.0|                   0.0|                     0.0|                  0.0|2010-02-05|                       0.0|
|   20|                           0.0|  12|     false|                      0.0|                       0.0|     6527.56|                         0.0|                   0.0|                     0.0|                  0.0|2010-02-05|                       0.0|
+-----+------------------------------+----+----------+-------------------------+--------------------------+------------+----------------------------+----------------------+------------------------+---------------------+----------+--------------------------+
only showing top 5 rows

<class 'pyspark.sql.dataframe.DataFrame'>

You can also inspect the metadata of the feature group. You can, for instance, show the features the feature group is made of and if they are primary or partition keys:

print("Name: {}".format(sales_fg.name))
print("Description: {}".format(sales_fg.description))
print("Features:")
features = sales_fg.features
for feature in features:
    print("{:<60} \t Primary: {} \t Partition: {}".format(feature.name, feature.primary, feature.partition))
Name: sales_fg
Description: Sales related features
Features:
store                                                            Primary: True   Partition: False
sales_last_six_month_store_dep                                   Primary: False      Partition: False
dept                                                             Primary: True   Partition: False
is_holiday                                                       Primary: False      Partition: False
sales_last_year_store_dep                                        Primary: False      Partition: False
sales_last_six_month_store                                       Primary: False      Partition: False
weekly_sales                                                     Primary: False      Partition: False
sales_last_quarter_store_dep                                     Primary: False      Partition: False
sales_last_month_store                                           Primary: False      Partition: False
sales_last_quarter_store                                         Primary: False      Partition: False
sales_last_year_store                                            Primary: False      Partition: False
date                                                             Primary: True   Partition: False
sales_last_month_store_dep                                       Primary: False      Partition: False

If you are interested only in a subset of features, you can use the select() method on the feature group object to select a list of features. The select() behaves like a feature group, as such, you can call the .show() or .read() methods on it.

sales_fg.select(['store', 'dept', 'weekly_sales']).show(5)
+-----+----+------------+
|store|dept|weekly_sales|
+-----+----+------------+
|   20|  55|    32362.95|
|   20|  94|    63787.83|
|   20|   2|    85812.69|
|   20|  92|   195223.84|
|   20|  12|     6527.56|
+-----+----+------------+
only showing top 5 rows

If your feature group is available both online and offline, you can use the online option of the show() and read() methods to specify if you want to read your feature group from online storage.

sales_fg_3 = fs.get_feature_group('sales_fg', 3)

sales_fg_3.select(['store', 'dept', 'weekly_sales']).show(5, online=True)
+-----+----+------------+
|store|dept|weekly_sales|
+-----+----+------------+
|   41|  97|    19861.01|
|   25|  52|     1542.31|
|   40|  42|     5279.49|
|   28|  21|      4933.9|
|   33|  25|        60.5|
+-----+----+------------+
only showing top 5 rows

Filter Feature Groups

If you do not want to read your feature group into a dataframe, you can also filter directly on a FeatureGroup object. Applying a filter to a feature group returns a Query object which can subsequently be used to be further joined with other feature groups or queries, or it can be materialized as a training dataset.

sales_fg.select(['store', 'dept', 'weekly_sales']).filter(sales_fg.weekly_sales >= 50000).show(5)
+-----+----+------------+
|store|dept|weekly_sales|
+-----+----+------------+
|   20|  94|    63787.83|
|   20|   2|    85812.69|
|   20|  92|   195223.84|
|   20|  95|    162621.8|
|   20|   8|    89319.61|
+-----+----+------------+
only showing top 5 rows

Conjunctions of filters can be constructed using the Python Bitwise Operators | and & which replace the regular binary operators when working with feature groups and filters.

sales_fg.select(['store', 'dept', 'weekly_sales']).filter((sales_fg.weekly_sales >= 50000) & (sales_fg.dept == 2)).show(5)
+-----+----+------------+
|store|dept|weekly_sales|
+-----+----+------------+
|   20|   2|    85812.69|
|   20|   2|    67951.33|
|   20|   2|    78321.16|
|   20|   2|    72410.12|
|   20|   2|    81139.43|
+-----+----+------------+
only showing top 5 rows

Join Features and Feature Groups

HSFS provides an API similar to Pandas to join feature groups together and to select features from different feature groups. The easies query you can write is by selecting all the features from a feature group and join them with all the features of another feature group.

You can use the select_all() method of a feature group to select all its features. HSFS relies on the Hopsworks feature store to identify which features of the two feature groups to use as joining condition. If you don’t specify anything, Hopsworks will use the largest matching subset of primary keys with the same name.

In the example below, sales_fg has store, dept and date as composite primary key while exogenous_fg has only store and date. So Hopsworks will set as joining condition store and date.

sales_fg = fs.get_feature_group('sales_fg')
exogenous_fg = fs.get_feature_group('exogenous_fg')

query = sales_fg.select_all().join(exogenous_fg.select_all())
VersionWarning: No version provided for getting feature group `sales_fg`, defaulting to `1`.
VersionWarning: No version provided for getting feature group `exogenous_fg`, defaulting to `1`.

You can use the query object to create training datasets (see training dataset notebook). You can inspect the query generated by calling the to_string() method on it.

print(query.to_string())
SELECT `fg1`.`dept`, `fg1`.`is_holiday`, `fg1`.`sales_last_year_store_dep`, `fg1`.`sales_last_six_month_store`, `fg1`.`store`, `fg1`.`sales_last_six_month_store_dep`, `fg1`.`weekly_sales`, `fg1`.`sales_last_quarter_store_dep`, `fg1`.`sales_last_month_store`, `fg1`.`sales_last_quarter_store`, `fg1`.`sales_last_year_store`, `fg1`.`date`, `fg1`.`sales_last_month_store_dep`, `fg0`.`cpi`, `fg0`.`is_holiday`, `fg0`.`markdown4`, `fg0`.`markdown5`, `fg0`.`unemployment`, `fg0`.`fuel_price`, `fg0`.`markdown1`, `fg0`.`markdown2`, `fg0`.`markdown3`, CASE WHEN `fg0`.`appended_feature` IS NULL THEN 10.0 ELSE `fg0`.`appended_feature` END `appended_feature`, `fg0`.`temperature`
FROM `demo_fs_meb10000_featurestore`.`sales_fg_1` `fg1`
INNER JOIN `demo_fs_meb10000_featurestore`.`exogenous_fg_1` `fg0` ON `fg1`.`store` = `fg0`.`store` AND `fg1`.`date` = `fg0`.`date`

As for the feature groups, you can call the show() method to inspect the data before generating a training dataset from it. Or you can call the read() method to get a Spark DataFrame with the result of the query and apply additional transformations to it.

query.show(5)
+----+----------+-------------------------+--------------------------+-----+------------------------------+------------+----------------------------+----------------------+------------------------+---------------------+----------+--------------------------+-----------+----------+---------+---------+------------+----------+---------+---------+---------+----------------+-----------+
|dept|is_holiday|sales_last_year_store_dep|sales_last_six_month_store|store|sales_last_six_month_store_dep|weekly_sales|sales_last_quarter_store_dep|sales_last_month_store|sales_last_quarter_store|sales_last_year_store|      date|sales_last_month_store_dep|        cpi|is_holiday|markdown4|markdown5|unemployment|fuel_price|markdown1|markdown2|markdown3|appended_feature|temperature|
+----+----------+-------------------------+--------------------------+-----+------------------------------+------------+----------------------------+----------------------+------------------------+---------------------+----------+--------------------------+-----------+----------+---------+---------+------------+----------+---------+---------+---------+----------------+-----------+
|  55|     false|                      0.0|                       0.0|   20|                           0.0|    32362.95|                         0.0|                   0.0|                     0.0|                  0.0|2010-02-05|                       0.0|204.2471935|     false|       NA|       NA|       8.187|     2.784|       NA|       NA|       NA|            10.0|      25.92|
|  94|     false|                      0.0|                       0.0|   20|                           0.0|    63787.83|                         0.0|                   0.0|                     0.0|                  0.0|2010-02-05|                       0.0|204.2471935|     false|       NA|       NA|       8.187|     2.784|       NA|       NA|       NA|            10.0|      25.92|
|   2|     false|                      0.0|                       0.0|   20|                           0.0|    85812.69|                         0.0|                   0.0|                     0.0|                  0.0|2010-02-05|                       0.0|204.2471935|     false|       NA|       NA|       8.187|     2.784|       NA|       NA|       NA|            10.0|      25.92|
|  92|     false|                      0.0|                       0.0|   20|                           0.0|   195223.84|                         0.0|                   0.0|                     0.0|                  0.0|2010-02-05|                       0.0|204.2471935|     false|       NA|       NA|       8.187|     2.784|       NA|       NA|       NA|            10.0|      25.92|
|  12|     false|                      0.0|                       0.0|   20|                           0.0|     6527.56|                         0.0|                   0.0|                     0.0|                  0.0|2010-02-05|                       0.0|204.2471935|     false|       NA|       NA|       8.187|     2.784|       NA|       NA|       NA|            10.0|      25.92|
+----+----------+-------------------------+--------------------------+-----+------------------------------+------------+----------------------------+----------------------+------------------------+---------------------+----------+--------------------------+-----------+----------+---------+---------+------------+----------+---------+---------+---------+----------------+-----------+
only showing top 5 rows

As for the show() and read() method of the feature group, even in the case of a query you can specify against which storage to run the query.

Select only a subset of features

You can replace the select_all() method with the select([]) method to be able to select only a subset of features from a feature group you want to join:

query = sales_fg.select(['store', 'dept', 'weekly_sales'])\
                .join(exogenous_fg.select(['fuel_price']))
query.show(5)
+-----+----+------------+----------+
|store|dept|weekly_sales|fuel_price|
+-----+----+------------+----------+
|   20|  55|    32362.95|     2.784|
|   20|  94|    63787.83|     2.784|
|   20|   2|    85812.69|     2.784|
|   20|  92|   195223.84|     2.784|
|   20|  12|     6527.56|     2.784|
+-----+----+------------+----------+
only showing top 5 rows

Overwrite the joining key

If your feature groups don’t have a primary key, or if they have different names or if you want to overwrite the joining key, you can pass it as a parameter of the join.

As in Pandas, if the feature has the same name on both feature groups, then you can use the on=[] paramter. If they have different names, then you can use the left_on=[] and right_on=[] paramters:

query = sales_fg.select(['store', 'dept', 'weekly_sales'])\
                .join(exogenous_fg.select(['fuel_price']), on=['date'])
query.show(5)
+-----+----+------------+----------+
|store|dept|weekly_sales|fuel_price|
+-----+----+------------+----------+
|   20|  55|    32362.95|     2.784|
|   20|  55|    32362.95|     2.666|
|   20|  55|    32362.95|     2.572|
|   20|  55|    32362.95|     2.962|
|   20|  55|    32362.95|      2.58|
+-----+----+------------+----------+
only showing top 5 rows

Overwriting the join type

By default, the join type between two feature groups is INNER JOIN. You can overwrite this behavior by passing the join_type parameter to the join method. Valid types are: INNER, LEFT, RIGHT, FULL, CROSS, LEFT_SEMI_JOIN, COMMA

query = sales_fg.select(['store', 'dept', 'weekly_sales'])\
                .join(exogenous_fg.select(['fuel_price']), join_type="left")

print(query.to_string())
SELECT `fg1`.`store`, `fg1`.`dept`, `fg1`.`weekly_sales`, `fg0`.`fuel_price`
FROM `demo_fs_meb10000_featurestore`.`sales_fg_1` `fg1`
LEFT JOIN `demo_fs_meb10000_featurestore`.`exogenous_fg_1` `fg0` ON `fg1`.`store` = `fg0`.`store` AND `fg1`.`date` = `fg0`.`date`

Join mulitple feature groups

You can concatenate as many feature gropus as you wish. In the example below the order of execution will be:

(`sales_fg` <> `store_fg`) <> `exogenous_fg`

The join paramers you pass in each join() method call apply to that specific join. This means that you can concatenate left and right joins. Please be aware that currently HSFS does not support nested join such as:

`sales_fg` <> (`store_fg` <> `exogenous_fg`)
store_fg = fs.get_feature_group("store_fg")

query = sales_fg.select_all()\
                .join(store_fg.select_all())\
                .join(exogenous_fg.select(['fuel_price', 'unemployment', 'cpi']))

print(query.to_string())
SELECT `fg2`.`dept`, `fg2`.`is_holiday`, `fg2`.`sales_last_year_store_dep`, `fg2`.`sales_last_six_month_store`, `fg2`.`store`, `fg2`.`sales_last_six_month_store_dep`, `fg2`.`weekly_sales`, `fg2`.`sales_last_quarter_store_dep`, `fg2`.`sales_last_month_store`, `fg2`.`sales_last_quarter_store`, `fg2`.`sales_last_year_store`, `fg2`.`date`, `fg2`.`sales_last_month_store_dep`, `fg0`.`type`, `fg0`.`size`, `fg0`.`num_depts`, `fg1`.`fuel_price`, `fg1`.`unemployment`, `fg1`.`cpi`
FROM `demo_fs_meb10000_featurestore`.`sales_fg_1` `fg2`
INNER JOIN `demo_fs_meb10000_featurestore`.`store_fg_1` `fg0` ON `fg2`.`store` = `fg0`.`store`
INNER JOIN `demo_fs_meb10000_featurestore`.`exogenous_fg_1` `fg1` ON `fg2`.`date` = `fg1`.`date` AND `fg2`.`store` = `fg1`.`store`
VersionWarning: No version provided for getting feature group `store_fg`, defaulting to `1`.

Use Joins together with Filters

It is possible to use filters in any of the subqueries of a joined query.

query = sales_fg.select_all()\
                .join(store_fg.select_all())\
                .join(exogenous_fg.select(['fuel_price', 'unemployment', 'cpi']).filter(exogenous_fg.fuel_price <= 2.7)) \
                .filter(sales_fg.weekly_sales >= 50000)

print(query.to_string())
SELECT `fg2`.`dept`, `fg2`.`is_holiday`, `fg2`.`sales_last_year_store_dep`, `fg2`.`sales_last_six_month_store`, `fg2`.`store`, `fg2`.`sales_last_six_month_store_dep`, `fg2`.`weekly_sales`, `fg2`.`sales_last_quarter_store_dep`, `fg2`.`sales_last_month_store`, `fg2`.`sales_last_quarter_store`, `fg2`.`sales_last_year_store`, `fg2`.`date`, `fg2`.`sales_last_month_store_dep`, `fg0`.`type`, `fg0`.`size`, `fg0`.`num_depts`, `fg1`.`fuel_price`, `fg1`.`unemployment`, `fg1`.`cpi`
FROM `demo_fs_meb10000_featurestore`.`sales_fg_1` `fg2`
INNER JOIN `demo_fs_meb10000_featurestore`.`store_fg_1` `fg0` ON `fg2`.`store` = `fg0`.`store`
INNER JOIN `demo_fs_meb10000_featurestore`.`exogenous_fg_1` `fg1` ON `fg2`.`date` = `fg1`.`date` AND `fg2`.`store` = `fg1`.`store`
WHERE `fg2`.`weekly_sales` >= 50000 AND `fg1`.`fuel_price` <= 2.7

Free hand query

With HSFS you are free of writing skipping entirely the Hopsworks query constructor and write your own query. This functionality can be useful if you need to express more complex queries for your use case. fs.sql returns a Spark Dataframe.

fs.sql("SELECT * FROM `store_fg_1`")
DataFrame[store: int, type: string, size: int, num_depts: bigint]