Delta Lake on the Feature Store

Delta Lake on Hops

This notebook contains some examples of how you can use Delta Lake on Hops.

Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads.

Key Features:

  • ACID Transactions:Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.

  • Scalable Metadata Handling:In big data, even the metadata itself can be “big data”. Delta Lake treats metadata just like data, leveraging Spark’s distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.

  • Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.

  • Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.

  • Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.

  • Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.

  • Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.

  • 100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.

  • Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.

  • Full DML Support: Delta Lake supports standard DML including UPDATE, DELETE and MERGE INTO providing developers more controls to manage their big datasets.

import io.hops.util.Hops
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.DataFrameWriter;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SaveMode;
import org.apache.spark.sql.SparkSession;
import java.sql.Date;
import java.sql.Timestamp;
import org.apache.spark.sql._
import spark.implicits._
import org.apache.spark.sql.types._
import io.hops.util.Hops
import org.apache.spark.api.java.JavaSparkContext
import org.apache.spark.sql.DataFrameWriter
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.Row
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.SparkSession
import java.sql.Date
import java.sql.Timestamp
import org.apache.spark.sql._
import spark.implicits._
import org.apache.spark.sql.types._
val bulkInsertData = Seq(
    Row(1, Date.valueOf("2019-02-30"), 0.4151f, "Sweden"),
    Row(2, Date.valueOf("2019-05-01"), 1.2151f, "Ireland"),
    Row(3, Date.valueOf("2019-08-06"), 0.2151f, "Belgium"),
    Row(4, Date.valueOf("2019-08-06"), 0.8151f, "Russia")
)
val schema = 
 scala.collection.immutable.List(
  StructField("id", IntegerType, true),
  StructField("date", DateType, true),
  StructField("value", FloatType, true),
  StructField("country", StringType, true) 
)
val bulkInsertDf = spark.createDataFrame(
  spark.sparkContext.parallelize(bulkInsertData),
  StructType(schema)
)
bulkInsertDf.show(5)
bulkInsertData: Seq[org.apache.spark.sql.Row] = List([1,2019-03-02,0.4151,Sweden], [2,2019-05-01,1.2151,Ireland], [3,2019-08-06,0.2151,Belgium], [4,2019-08-06,0.8151,Russia])
schema: List[org.apache.spark.sql.types.StructField] = List(StructField(id,IntegerType,true), StructField(date,DateType,true), StructField(value,FloatType,true), StructField(country,StringType,true))
bulkInsertDf: org.apache.spark.sql.DataFrame = [id: int, date: date ... 2 more fields]
+---+----------+------+-------+
| id|      date| value|country|
+---+----------+------+-------+
|  1|2019-03-02|0.4151| Sweden|
|  2|2019-05-01|1.2151|Ireland|
|  3|2019-08-06|0.2151|Belgium|
|  4|2019-08-06|0.8151| Russia|
+---+----------+------+-------+

To create a Delta dataset, simply set the data format to “delta”:

bulkInsertDf.write.format("delta").save(s"hdfs:///Projects/${Hops.getProjectName}/Resources/hello_delta")

A Delta dataset keep tracks of a commit log to support ACID transactions.

Delta Dataset

val df = spark.read.format("delta").load(s"hdfs:///Projects/${Hops.getProjectName}/Resources/hello_delta")
df.show()

Delta also provides time-travel functionality that lets you inspect the value of a dataset at a particular point in time:

val df = spark.read.format("delta").option("versionAsOf", 0).load(s"hdfs:///Projects/${Hops.getProjectName}/Resources/hello_delta")
df.show()
df: org.apache.spark.sql.DataFrame = [id: int, date: date ... 2 more fields]
+---+----------+------+-------+
| id|      date| value|country|
+---+----------+------+-------+
|  3|2019-08-06|0.2151|Belgium|
|  4|2019-08-06|0.8151| Russia|
|  1|2019-03-02|0.4151| Sweden|
|  2|2019-05-01|1.2151|Ireland|
+---+----------+------+-------+

Delta supports usperts and overwrite:

val overwriteData = Seq(
    Row(1, Date.valueOf("2019-06-30"), 0.4151f, "Sweden"),
    Row(2, Date.valueOf("2019-05-01"), 1.2151f, "Ireland"),
    Row(3, Date.valueOf("2017-08-06"), 0.2151f, "Belgium"),
    Row(4, Date.valueOf("2019-08-06"), 0.8151f, "Russia")
)
val overwriteDataDf = spark.createDataFrame(
  spark.sparkContext.parallelize(overwriteData),
  StructType(schema)
)
overwriteDataDf.show(5)
overwriteData: Seq[org.apache.spark.sql.Row] = List([1,2019-06-30,0.4151,Sweden], [2,2019-05-01,1.2151,Ireland], [3,2017-08-06,0.2151,Belgium], [4,2019-08-06,0.8151,Russia])
overwriteDataDf: org.apache.spark.sql.DataFrame = [id: int, date: date ... 2 more fields]
+---+----------+------+-------+
| id|      date| value|country|
+---+----------+------+-------+
|  1|2019-06-30|0.4151| Sweden|
|  2|2019-05-01|1.2151|Ireland|
|  3|2017-08-06|0.2151|Belgium|
|  4|2019-08-06|0.8151| Russia|
+---+----------+------+-------+
overwriteDataDf.write.format("delta").mode("overwrite").save(s"hdfs:///Projects/${Hops.getProjectName}/Resources/hello_delta")
val df = spark.read.format("delta").load(s"hdfs:///Projects/${Hops.getProjectName}/Resources/hello_delta")
df.show()
df: org.apache.spark.sql.DataFrame = [id: int, date: date ... 2 more fields]
+---+----------+------+-------+
| id|      date| value|country|
+---+----------+------+-------+
|  1|2019-06-30|0.4151| Sweden|
|  2|2019-05-01|1.2151|Ireland|
|  3|2017-08-06|0.2151|Belgium|
|  4|2019-08-06|0.8151| Russia|
+---+----------+------+-------+
val df = spark.read.format("delta").option("versionAsOf", 0).load(s"hdfs:///Projects/${Hops.getProjectName}/Resources/hello_delta")
df.show()
df: org.apache.spark.sql.DataFrame = [id: int, date: date ... 2 more fields]
+---+----------+------+-------+
| id|      date| value|country|
+---+----------+------+-------+
|  3|2019-08-06|0.2151|Belgium|
|  4|2019-08-06|0.8151| Russia|
|  1|2019-03-02|0.4151| Sweden|
|  2|2019-05-01|1.2151|Ireland|
+---+----------+------+-------+
val df = spark.read.format("delta").option("versionAsOf", 1).load(s"hdfs:///Projects/${Hops.getProjectName}/Resources/hello_delta")
df.show()
df: org.apache.spark.sql.DataFrame = [id: int, date: date ... 2 more fields]
+---+----------+------+-------+
| id|      date| value|country|
+---+----------+------+-------+
|  1|2019-06-30|0.4151| Sweden|
|  2|2019-05-01|1.2151|Ireland|
|  3|2017-08-06|0.2151|Belgium|
|  4|2019-08-06|0.8151| Russia|
+---+----------+------+-------+

To upsert data in a delta dataset, use the merge primitive:

val upsertData = Seq(
    Row(5, Date.valueOf("2019-02-30"), 0.7921f, "Northern Ireland"), //Insert
    Row(1, Date.valueOf("2019-05-01"), 1.151f, "Norway"), //Update
    Row(3, Date.valueOf("2019-08-06"), 0.999f, "Belgium"), //Update
    Row(6, Date.valueOf("2019-08-06"), 0.0151f, "France") //Insert
)
val upsertDf = spark.createDataFrame(
  spark.sparkContext.parallelize(upsertData),
  StructType(schema)
)
upsertDf.show(5)
upsertData: Seq[org.apache.spark.sql.Row] = List([5,2019-03-02,0.7921,Northern Ireland], [1,2019-05-01,1.151,Norway], [3,2019-08-06,0.999,Belgium], [6,2019-08-06,0.0151,France])
upsertDf: org.apache.spark.sql.DataFrame = [id: int, date: date ... 2 more fields]
+---+----------+------+----------------+
| id|      date| value|         country|
+---+----------+------+----------------+
|  5|2019-03-02|0.7921|Northern Ireland|
|  1|2019-05-01| 1.151|          Norway|
|  3|2019-08-06| 0.999|         Belgium|
|  6|2019-08-06|0.0151|          France|
+---+----------+------+----------------+
import io.delta.tables._
import org.apache.spark.sql.functions._
import io.delta.tables._
import org.apache.spark.sql.functions._
val deltaTable = DeltaTable.forPath(s"hdfs:///Projects/${Hops.getProjectName}/Resources/hello_delta")
deltaTable: io.delta.tables.DeltaTable = io.delta.tables.DeltaTable@10f145bb
val newData = upsertDf.as("newData")
newData: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: int, date: date ... 2 more fields]
(deltaTable.as("oldData")
  .merge(
    newData,
    "oldData.id = newData.id")
  .whenMatched
  .update(Map("id" -> col("newData.id"), "date" -> col("newData.date"), 
              "value" -> col("newData.value"), "country" -> col("newData.country")))
  .whenNotMatched
  .insert(Map("id" -> col("newData.id"), "date" -> col("newData.date"), 
              "value" -> col("newData.value"), "country" -> col("newData.country")))
  .execute())
val df = spark.read.format("delta").load(s"hdfs:///Projects/${Hops.getProjectName}/Resources/hello_delta")
df.show()
df: org.apache.spark.sql.DataFrame = [id: int, date: date ... 2 more fields]
+---+----------+------+----------------+
| id|      date| value|         country|
+---+----------+------+----------------+
|  5|2019-03-02|0.7921|Northern Ireland|
|  2|2019-05-01|1.2151|         Ireland|
|  3|2019-08-06| 0.999|         Belgium|
|  6|2019-08-06|0.0151|          France|
|  4|2019-08-06|0.8151|          Russia|
|  1|2019-05-01| 1.151|          Norway|
+---+----------+------+----------------+
val df = spark.read.format("delta").option("versionAsOf", 0).load(s"hdfs:///Projects/${Hops.getProjectName}/Resources/hello_delta")
df.show()
df: org.apache.spark.sql.DataFrame = [id: int, date: date ... 2 more fields]
+---+----------+------+-------+
| id|      date| value|country|
+---+----------+------+-------+
|  3|2019-08-06|0.2151|Belgium|
|  4|2019-08-06|0.8151| Russia|
|  1|2019-03-02|0.4151| Sweden|
|  2|2019-05-01|1.2151|Ireland|
+---+----------+------+-------+
val df = spark.read.format("delta").option("versionAsOf", 1).load(s"hdfs:///Projects/${Hops.getProjectName}/Resources/hello_delta")
df.show()
df: org.apache.spark.sql.DataFrame = [id: int, date: date ... 2 more fields]
+---+----------+------+-------+
| id|      date| value|country|
+---+----------+------+-------+
|  1|2019-06-30|0.4151| Sweden|
|  2|2019-05-01|1.2151|Ireland|
|  3|2017-08-06|0.2151|Belgium|
|  4|2019-08-06|0.8151| Russia|
+---+----------+------+-------+
val df = spark.read.format("delta").option("versionAsOf", 2).load(s"hdfs:///Projects/${Hops.getProjectName}/Resources/hello_delta")
df.show()
df: org.apache.spark.sql.DataFrame = [id: int, date: date ... 2 more fields]
+---+----------+------+----------------+
| id|      date| value|         country|
+---+----------+------+----------------+
|  5|2019-03-02|0.7921|Northern Ireland|
|  2|2019-05-01|1.2151|         Ireland|
|  3|2019-08-06| 0.999|         Belgium|
|  6|2019-08-06|0.0151|          France|
|  4|2019-08-06|0.8151|          Russia|
|  1|2019-05-01| 1.151|          Norway|
+---+----------+------+----------------+