1. Feature Exploration
HSFS feature exploration
In this notebook we are going to walk through how to use the HSFS library to explore feature groups and features in the Hopsworks Feature Store.
A key component of the Hopsworks feature store is to enable sharing and re-using of features across models and use cases. As such, the HSFS libraries allows user to join features from different feature groups and use them to create training datasets. Features can be taken also from different feature stores (projects) as long as the user running the notebook has the read access to those.
looking!!! feature_engineering As for the feature_engineering notebook, the first step is to establish a connection with the feature store and retrieve the project feature store handle.
import hsfs
# Create a connection
connection = hsfs.connection()
# Get the feature store handle for the project's feature store
fs = connection.get_feature_store()
Starting Spark application
ID | YARN Application ID | Kind | State | Spark UI | Driver log |
---|---|---|---|---|---|
28 | application_1607613578055_0015 | pyspark | idle | Link | Link |
SparkSession available as 'spark'.
Connected. Call `.close()` to terminate connection gracefully.
Explore feature groups
You can interact with the feature groups as if they were Spark dataframe. A feature group object has a show()
method, to show n
number of lines, and a read()
method to read the content of the feature group in a Spark dataframe.
The first step to do any operation on a feature group is to get its handle from the feature store. The get_feature_group
method accepts the name of the feature group and an optional parameter with the version you want to select. If you do not provide a version, the APIs will default to version 1
sales_fg = fs.get_feature_group("sales_fg")
VersionWarning: No version provided for getting feature group `sales_fg`, defaulting to `1`.
sales_fg.show(5)
+-----+------------------------------+----+----------+-------------------------+--------------------------+------------+----------------------------+----------------------+------------------------+---------------------+----------+--------------------------+
|store|sales_last_six_month_store_dep|dept|is_holiday|sales_last_year_store_dep|sales_last_six_month_store|weekly_sales|sales_last_quarter_store_dep|sales_last_month_store|sales_last_quarter_store|sales_last_year_store| date|sales_last_month_store_dep|
+-----+------------------------------+----+----------+-------------------------+--------------------------+------------+----------------------------+----------------------+------------------------+---------------------+----------+--------------------------+
| 20| 0.0| 55| false| 0.0| 0.0| 32362.95| 0.0| 0.0| 0.0| 0.0|2010-02-05| 0.0|
| 20| 0.0| 94| false| 0.0| 0.0| 63787.83| 0.0| 0.0| 0.0| 0.0|2010-02-05| 0.0|
| 20| 0.0| 2| false| 0.0| 0.0| 85812.69| 0.0| 0.0| 0.0| 0.0|2010-02-05| 0.0|
| 20| 0.0| 92| false| 0.0| 0.0| 195223.84| 0.0| 0.0| 0.0| 0.0|2010-02-05| 0.0|
| 20| 0.0| 12| false| 0.0| 0.0| 6527.56| 0.0| 0.0| 0.0| 0.0|2010-02-05| 0.0|
+-----+------------------------------+----+----------+-------------------------+--------------------------+------------+----------------------------+----------------------+------------------------+---------------------+----------+--------------------------+
only showing top 5 rows
sales_df = sales_fg.read()
sales_df.filter("store == 20").show(5)
print(type(sales_df))
+-----+------------------------------+----+----------+-------------------------+--------------------------+------------+----------------------------+----------------------+------------------------+---------------------+----------+--------------------------+
|store|sales_last_six_month_store_dep|dept|is_holiday|sales_last_year_store_dep|sales_last_six_month_store|weekly_sales|sales_last_quarter_store_dep|sales_last_month_store|sales_last_quarter_store|sales_last_year_store| date|sales_last_month_store_dep|
+-----+------------------------------+----+----------+-------------------------+--------------------------+------------+----------------------------+----------------------+------------------------+---------------------+----------+--------------------------+
| 20| 0.0| 55| false| 0.0| 0.0| 32362.95| 0.0| 0.0| 0.0| 0.0|2010-02-05| 0.0|
| 20| 0.0| 94| false| 0.0| 0.0| 63787.83| 0.0| 0.0| 0.0| 0.0|2010-02-05| 0.0|
| 20| 0.0| 2| false| 0.0| 0.0| 85812.69| 0.0| 0.0| 0.0| 0.0|2010-02-05| 0.0|
| 20| 0.0| 92| false| 0.0| 0.0| 195223.84| 0.0| 0.0| 0.0| 0.0|2010-02-05| 0.0|
| 20| 0.0| 12| false| 0.0| 0.0| 6527.56| 0.0| 0.0| 0.0| 0.0|2010-02-05| 0.0|
+-----+------------------------------+----+----------+-------------------------+--------------------------+------------+----------------------------+----------------------+------------------------+---------------------+----------+--------------------------+
only showing top 5 rows
<class 'pyspark.sql.dataframe.DataFrame'>
You can also inspect the metadata of the feature group. You can, for instance, show the features the feature group is made of and if they are primary or partition keys:
print("Name: {}".format(sales_fg.name))
print("Description: {}".format(sales_fg.description))
print("Features:")
features = sales_fg.features
for feature in features:
print("{:<60} \t Primary: {} \t Partition: {}".format(feature.name, feature.primary, feature.partition))
Name: sales_fg
Description: Sales related features
Features:
store Primary: True Partition: False
sales_last_six_month_store_dep Primary: False Partition: False
dept Primary: True Partition: False
is_holiday Primary: False Partition: False
sales_last_year_store_dep Primary: False Partition: False
sales_last_six_month_store Primary: False Partition: False
weekly_sales Primary: False Partition: False
sales_last_quarter_store_dep Primary: False Partition: False
sales_last_month_store Primary: False Partition: False
sales_last_quarter_store Primary: False Partition: False
sales_last_year_store Primary: False Partition: False
date Primary: True Partition: False
sales_last_month_store_dep Primary: False Partition: False
If you are interested only in a subset of features, you can use the select()
method on the feature group object to select a list of features. The select()
behaves like a feature group, as such, you can call the .show()
or .read()
methods on it.
sales_fg.select(['store', 'dept', 'weekly_sales']).show(5)
+-----+----+------------+
|store|dept|weekly_sales|
+-----+----+------------+
| 20| 55| 32362.95|
| 20| 94| 63787.83|
| 20| 2| 85812.69|
| 20| 92| 195223.84|
| 20| 12| 6527.56|
+-----+----+------------+
only showing top 5 rows
If your feature group is available both online and offline, you can use the online
option of the show()
and read()
methods to specify if you want to read your feature group from online storage.
sales_fg_3 = fs.get_feature_group('sales_fg', 3)
sales_fg_3.select(['store', 'dept', 'weekly_sales']).show(5, online=True)
+-----+----+------------+
|store|dept|weekly_sales|
+-----+----+------------+
| 41| 97| 19861.01|
| 25| 52| 1542.31|
| 40| 42| 5279.49|
| 28| 21| 4933.9|
| 33| 25| 60.5|
+-----+----+------------+
only showing top 5 rows
Filter Feature Groups
If you do not want to read your feature group into a dataframe, you can also filter directly on a FeatureGroup
object. Applying a filter to a feature group returns a Query
object which can subsequently be used to be further joined with other feature groups or queries, or it can be materialized as a training dataset.
sales_fg.select(['store', 'dept', 'weekly_sales']).filter(sales_fg.weekly_sales >= 50000).show(5)
+-----+----+------------+
|store|dept|weekly_sales|
+-----+----+------------+
| 20| 94| 63787.83|
| 20| 2| 85812.69|
| 20| 92| 195223.84|
| 20| 95| 162621.8|
| 20| 8| 89319.61|
+-----+----+------------+
only showing top 5 rows
Conjunctions of filters can be constructed using the Python Bitwise Operators |
and &
which replace the regular binary operators when working with feature groups and filters.
sales_fg.select(['store', 'dept', 'weekly_sales']).filter((sales_fg.weekly_sales >= 50000) & (sales_fg.dept == 2)).show(5)
+-----+----+------------+
|store|dept|weekly_sales|
+-----+----+------------+
| 20| 2| 85812.69|
| 20| 2| 67951.33|
| 20| 2| 78321.16|
| 20| 2| 72410.12|
| 20| 2| 81139.43|
+-----+----+------------+
only showing top 5 rows
Join Features and Feature Groups
HSFS provides an API similar to Pandas to join feature groups together and to select features from different feature groups. The easies query you can write is by selecting all the features from a feature group and join them with all the features of another feature group.
You can use the select_all()
method of a feature group to select all its features. HSFS relies on the Hopsworks feature store to identify which features of the two feature groups to use as joining condition.
If you don’t specify anything, Hopsworks will use the largest matching subset of primary keys with the same name.
In the example below, sales_fg
has store
, dept
and date
as composite primary key while exogenous_fg
has only store
and date
. So Hopsworks will set as joining condition store
and date
.
sales_fg = fs.get_feature_group('sales_fg')
exogenous_fg = fs.get_feature_group('exogenous_fg')
query = sales_fg.select_all().join(exogenous_fg.select_all())
VersionWarning: No version provided for getting feature group `sales_fg`, defaulting to `1`.
VersionWarning: No version provided for getting feature group `exogenous_fg`, defaulting to `1`.
You can use the query object to create training datasets (see training dataset notebook). You can inspect the query generated by calling the to_string()
method on it.
print(query.to_string())
SELECT `fg1`.`dept`, `fg1`.`is_holiday`, `fg1`.`sales_last_year_store_dep`, `fg1`.`sales_last_six_month_store`, `fg1`.`store`, `fg1`.`sales_last_six_month_store_dep`, `fg1`.`weekly_sales`, `fg1`.`sales_last_quarter_store_dep`, `fg1`.`sales_last_month_store`, `fg1`.`sales_last_quarter_store`, `fg1`.`sales_last_year_store`, `fg1`.`date`, `fg1`.`sales_last_month_store_dep`, `fg0`.`cpi`, `fg0`.`is_holiday`, `fg0`.`markdown4`, `fg0`.`markdown5`, `fg0`.`unemployment`, `fg0`.`fuel_price`, `fg0`.`markdown1`, `fg0`.`markdown2`, `fg0`.`markdown3`, CASE WHEN `fg0`.`appended_feature` IS NULL THEN 10.0 ELSE `fg0`.`appended_feature` END `appended_feature`, `fg0`.`temperature`
FROM `demo_fs_meb10000_featurestore`.`sales_fg_1` `fg1`
INNER JOIN `demo_fs_meb10000_featurestore`.`exogenous_fg_1` `fg0` ON `fg1`.`store` = `fg0`.`store` AND `fg1`.`date` = `fg0`.`date`
As for the feature groups, you can call the show()
method to inspect the data before generating a training dataset from it. Or you can call the read()
method to get a Spark DataFrame with the result of the query and apply additional transformations to it.
query.show(5)
+----+----------+-------------------------+--------------------------+-----+------------------------------+------------+----------------------------+----------------------+------------------------+---------------------+----------+--------------------------+-----------+----------+---------+---------+------------+----------+---------+---------+---------+----------------+-----------+
|dept|is_holiday|sales_last_year_store_dep|sales_last_six_month_store|store|sales_last_six_month_store_dep|weekly_sales|sales_last_quarter_store_dep|sales_last_month_store|sales_last_quarter_store|sales_last_year_store| date|sales_last_month_store_dep| cpi|is_holiday|markdown4|markdown5|unemployment|fuel_price|markdown1|markdown2|markdown3|appended_feature|temperature|
+----+----------+-------------------------+--------------------------+-----+------------------------------+------------+----------------------------+----------------------+------------------------+---------------------+----------+--------------------------+-----------+----------+---------+---------+------------+----------+---------+---------+---------+----------------+-----------+
| 55| false| 0.0| 0.0| 20| 0.0| 32362.95| 0.0| 0.0| 0.0| 0.0|2010-02-05| 0.0|204.2471935| false| NA| NA| 8.187| 2.784| NA| NA| NA| 10.0| 25.92|
| 94| false| 0.0| 0.0| 20| 0.0| 63787.83| 0.0| 0.0| 0.0| 0.0|2010-02-05| 0.0|204.2471935| false| NA| NA| 8.187| 2.784| NA| NA| NA| 10.0| 25.92|
| 2| false| 0.0| 0.0| 20| 0.0| 85812.69| 0.0| 0.0| 0.0| 0.0|2010-02-05| 0.0|204.2471935| false| NA| NA| 8.187| 2.784| NA| NA| NA| 10.0| 25.92|
| 92| false| 0.0| 0.0| 20| 0.0| 195223.84| 0.0| 0.0| 0.0| 0.0|2010-02-05| 0.0|204.2471935| false| NA| NA| 8.187| 2.784| NA| NA| NA| 10.0| 25.92|
| 12| false| 0.0| 0.0| 20| 0.0| 6527.56| 0.0| 0.0| 0.0| 0.0|2010-02-05| 0.0|204.2471935| false| NA| NA| 8.187| 2.784| NA| NA| NA| 10.0| 25.92|
+----+----------+-------------------------+--------------------------+-----+------------------------------+------------+----------------------------+----------------------+------------------------+---------------------+----------+--------------------------+-----------+----------+---------+---------+------------+----------+---------+---------+---------+----------------+-----------+
only showing top 5 rows
As for the show()
and read()
method of the feature group, even in the case of a query you can specify against which storage to run the query.
Select only a subset of features
You can replace the select_all()
method with the select([])
method to be able to select only a subset of features from a feature group you want to join:
query = sales_fg.select(['store', 'dept', 'weekly_sales'])\
.join(exogenous_fg.select(['fuel_price']))
query.show(5)
+-----+----+------------+----------+
|store|dept|weekly_sales|fuel_price|
+-----+----+------------+----------+
| 20| 55| 32362.95| 2.784|
| 20| 94| 63787.83| 2.784|
| 20| 2| 85812.69| 2.784|
| 20| 92| 195223.84| 2.784|
| 20| 12| 6527.56| 2.784|
+-----+----+------------+----------+
only showing top 5 rows
Overwrite the joining key
If your feature groups don’t have a primary key, or if they have different names or if you want to overwrite the joining key, you can pass it as a parameter of the join.
As in Pandas, if the feature has the same name on both feature groups, then you can use the on=[]
paramter. If they have different names, then you can use the left_on=[]
and right_on=[]
paramters:
query = sales_fg.select(['store', 'dept', 'weekly_sales'])\
.join(exogenous_fg.select(['fuel_price']), on=['date'])
query.show(5)
+-----+----+------------+----------+
|store|dept|weekly_sales|fuel_price|
+-----+----+------------+----------+
| 20| 55| 32362.95| 2.784|
| 20| 55| 32362.95| 2.666|
| 20| 55| 32362.95| 2.572|
| 20| 55| 32362.95| 2.962|
| 20| 55| 32362.95| 2.58|
+-----+----+------------+----------+
only showing top 5 rows
Overwriting the join type
By default, the join type between two feature groups is INNER JOIN
. You can overwrite this behavior by passing the join_type
parameter to the join method. Valid types are: INNER, LEFT, RIGHT, FULL, CROSS, LEFT_SEMI_JOIN, COMMA
query = sales_fg.select(['store', 'dept', 'weekly_sales'])\
.join(exogenous_fg.select(['fuel_price']), join_type="left")
print(query.to_string())
SELECT `fg1`.`store`, `fg1`.`dept`, `fg1`.`weekly_sales`, `fg0`.`fuel_price`
FROM `demo_fs_meb10000_featurestore`.`sales_fg_1` `fg1`
LEFT JOIN `demo_fs_meb10000_featurestore`.`exogenous_fg_1` `fg0` ON `fg1`.`store` = `fg0`.`store` AND `fg1`.`date` = `fg0`.`date`
Join mulitple feature groups
You can concatenate as many feature gropus as you wish. In the example below the order of execution will be:
(`sales_fg` <> `store_fg`) <> `exogenous_fg`
The join paramers you pass in each join()
method call apply to that specific join. This means that you can concatenate left and right joins.
Please be aware that currently HSFS does not support nested join such as:
`sales_fg` <> (`store_fg` <> `exogenous_fg`)
store_fg = fs.get_feature_group("store_fg")
query = sales_fg.select_all()\
.join(store_fg.select_all())\
.join(exogenous_fg.select(['fuel_price', 'unemployment', 'cpi']))
print(query.to_string())
SELECT `fg2`.`dept`, `fg2`.`is_holiday`, `fg2`.`sales_last_year_store_dep`, `fg2`.`sales_last_six_month_store`, `fg2`.`store`, `fg2`.`sales_last_six_month_store_dep`, `fg2`.`weekly_sales`, `fg2`.`sales_last_quarter_store_dep`, `fg2`.`sales_last_month_store`, `fg2`.`sales_last_quarter_store`, `fg2`.`sales_last_year_store`, `fg2`.`date`, `fg2`.`sales_last_month_store_dep`, `fg0`.`type`, `fg0`.`size`, `fg0`.`num_depts`, `fg1`.`fuel_price`, `fg1`.`unemployment`, `fg1`.`cpi`
FROM `demo_fs_meb10000_featurestore`.`sales_fg_1` `fg2`
INNER JOIN `demo_fs_meb10000_featurestore`.`store_fg_1` `fg0` ON `fg2`.`store` = `fg0`.`store`
INNER JOIN `demo_fs_meb10000_featurestore`.`exogenous_fg_1` `fg1` ON `fg2`.`date` = `fg1`.`date` AND `fg2`.`store` = `fg1`.`store`
VersionWarning: No version provided for getting feature group `store_fg`, defaulting to `1`.
Use Joins together with Filters
It is possible to use filters in any of the subqueries of a joined query.
query = sales_fg.select_all()\
.join(store_fg.select_all())\
.join(exogenous_fg.select(['fuel_price', 'unemployment', 'cpi']).filter(exogenous_fg.fuel_price <= 2.7)) \
.filter(sales_fg.weekly_sales >= 50000)
print(query.to_string())
SELECT `fg2`.`dept`, `fg2`.`is_holiday`, `fg2`.`sales_last_year_store_dep`, `fg2`.`sales_last_six_month_store`, `fg2`.`store`, `fg2`.`sales_last_six_month_store_dep`, `fg2`.`weekly_sales`, `fg2`.`sales_last_quarter_store_dep`, `fg2`.`sales_last_month_store`, `fg2`.`sales_last_quarter_store`, `fg2`.`sales_last_year_store`, `fg2`.`date`, `fg2`.`sales_last_month_store_dep`, `fg0`.`type`, `fg0`.`size`, `fg0`.`num_depts`, `fg1`.`fuel_price`, `fg1`.`unemployment`, `fg1`.`cpi`
FROM `demo_fs_meb10000_featurestore`.`sales_fg_1` `fg2`
INNER JOIN `demo_fs_meb10000_featurestore`.`store_fg_1` `fg0` ON `fg2`.`store` = `fg0`.`store`
INNER JOIN `demo_fs_meb10000_featurestore`.`exogenous_fg_1` `fg1` ON `fg2`.`date` = `fg1`.`date` AND `fg2`.`store` = `fg1`.`store`
WHERE `fg2`.`weekly_sales` >= 50000 AND `fg1`.`fuel_price` <= 2.7
Free hand query
With HSFS you are free of writing skipping entirely the Hopsworks query constructor and write your own query. This functionality can be useful if you need to express more complex queries for your use case. fs.sql
returns a Spark Dataframe.
fs.sql("SELECT * FROM `store_fg_1`")
DataFrame[store: int, type: string, size: int, num_depts: bigint]