Titanic Dataset with the Feature Store

Titanic Dataset for the Feature Store

This notebook prepares the Titanic dataset to be used with the feature store.

The Titanic dataset contains information about the passengers of the famous Titanic ship. The training and test data come in form of two CSV files, which can be downloaded from the Titanic Competition page on Kaggle.

Download the train.csv and test.csv files, and upload them to the Resources folder of your Hopsworks Project. If you prefer doing things using GUIs, then you can find the Resources by opening the Data Sets tab on the left menu bar.

Once you have the two files uploaded on the Resources folder, you can proceed with the rest of the notebook.

import tensorflow as tf
from hops import hdfs
from pyspark.sql import functions as F
import hsfs
Starting Spark application
IDYARN Application IDKindStateSpark UIDriver log
0application_1614293057610_0001pysparkidleLinkLink
SparkSession available as 'spark'.

Let’s begin by reading the training data into a Spark DataFrame:

# read the training data csv into a Spark DataFrame

training_csv_path = hdfs.project_path() + 'Resources/' + 'train.csv'
raw_df = spark.read.csv(training_csv_path, header=True)

Now, we can do some simple preprocessing. Rather than registering the whole dataset with the Feature Store, we just select a few of the columns, and cast all columns to int. Since the values of the sex column are either male or female, we also convert them to 0 or 1, respectively. We also fill the missing values of the age column with 30.

# simple preprocessing:
#     1 - selecting a few of the columns
#     2 - Filling the missing 'age' values with 30
#     3 - changing sex to 0 or 1
#     4 - casting all columns to int

clean_train_df = raw_df.select('survived', 'pclass', 'sex', 'fare', 'age', 'sibsp', 'parch') \
                    .fillna({'age': 30}) \
                    .withColumn('sex',
                        F.when(F.col('sex')=='male', 0)
                        .otherwise(1))\
                    .withColumn('survived',
                               F.col('survived').cast('int')) \
                    .withColumn('pclass',
                               F.col('pclass').cast('int')) \
                    .withColumn('fare',
                                F.col('fare').cast('int')) \
                    .withColumn('age',
                               F.col('age').cast('int')) \
                    .withColumn('sibsp',
                               F.col('sibsp').cast('int')) \
                    .withColumn('parch',
                               F.col('parch').cast('int'))

Let’s see how our “clean” dataframe looks like now:

clean_train_df.show()
+--------+------+---+----+---+-----+-----+
|survived|pclass|sex|fare|age|sibsp|parch|
+--------+------+---+----+---+-----+-----+
|       0|     3|  0|   7| 22|    1|    0|
|       1|     1|  1|  71| 38|    1|    0|
|       1|     3|  1|   7| 26|    0|    0|
|       1|     1|  1|  53| 35|    1|    0|
|       0|     3|  0|   8| 35|    0|    0|
|       0|     3|  0|   8| 30|    0|    0|
|       0|     1|  0|  51| 54|    0|    0|
|       0|     3|  0|  21|  2|    3|    1|
|       1|     3|  1|  11| 27|    0|    2|
|       1|     2|  1|  30| 14|    1|    0|
|       1|     3|  1|  16|  4|    1|    1|
|       1|     1|  1|  26| 58|    0|    0|
|       0|     3|  0|   8| 20|    0|    0|
|       0|     3|  0|  31| 39|    1|    5|
|       0|     3|  1|   7| 14|    0|    0|
|       1|     2|  1|  16| 55|    0|    0|
|       0|     3|  0|  29|  2|    4|    1|
|       1|     2|  0|  13| 30|    0|    0|
|       0|     3|  1|  18| 31|    1|    0|
|       1|     3|  1|   7| 30|    0|    0|
+--------+------+---+----+---+-----+-----+
only showing top 20 rows

The next step would be to create a feature group from our clean dataframe, so as to register it with the Project’s Feature Store:

connection = hsfs.connection()
fs = connection.get_feature_store()
Connected. Call `.close()` to terminate connection gracefully.
# create a feature group from the training data DataFrame
titantic_fg = fs.create_feature_group(name="titanic_training_all_features",
                                       version=1,
                                       description="titanic training dataset with clean features",
                                       time_travel_format=None,
                                       statistics_config={"enabled": True, "histograms":True, "correlations": True, "exact_uniqueness": True})
titantic_fg.save(clean_train_df)
<hsfs.feature_group.FeatureGroup object at 0x7f63e6e25b50>

Now, we can forget about our previous “clean” dataframe that we read directly from the CSV file, and retrieve the training dataframe from the feature store:

# retrieve dataframe from feature store
titanic_df = fs.get_feature_group('titanic_training_all_features',version=1)
titanic_df.show(4)
+-----+---+----+--------+------+---+-----+
|sibsp|sex|fare|survived|pclass|age|parch|
+-----+---+----+--------+------+---+-----+
|    1|  0|   7|       0|     3| 22|    0|
|    1|  1|  71|       1|     1| 38|    0|
|    0|  1|   7|       1|     3| 26|    0|
|    1|  1|  53|       1|     1| 35|    0|
+-----+---+----+--------+------+---+-----+
only showing top 4 rows

Finally, we create a training dataset from the feature group. This is a very simple task using the Feature Store API. You can provide a name, and the data format for the dataset. For now, let’s stick with tfrecord, TensorFlow’s own file format.

td = fs.create_training_dataset(name="titanic_train_dataset",
                               description="Dataset to train Titantic survival model",
                               data_format="tfrecord",
                               version=1)
td.save(titanic_df.read())
<hsfs.training_dataset.TrainingDataset object at 0x7f63e6de3b90>

Done! you can now use the titanic training data in your Projects!