Exam 203 back-end services Spark Delta Lake

From MillerSql.com
Revision as of 18:10, 17 November 2024 by NeilM (talk | contribs)

Spark Delta Lake.

This is a layer on top of Spark that provides for relational databases.

By using Delta Lake, you can implement a data lakehouse architecture in Spark.

Delta Lake supports:

  1. CRUD (create, read, update, and delete) operations
  2. ACID atomicity (transactions complete as a single unit of work), consistency (transactions leave the database in a consistent state), isolation (in-process transactions can't interfere with one another), and durability (when a transaction completes, the changes it made are persisted
  3. Data versioning and time travel
  4. Streaming as well as Batch data. Spark Structured Streaming API
  5. Underlying data is in Parquet format only, not CSV.
  6. Can use the Serverless pool in Synapse Studio to query it.

Create a Delta Lake table

Use the .Write method on a dataframe to create a Delta Lake table, specifying .format("delta"):

# Load a file into a dataframe
df = spark.read.load('/data/mydata.csv', format='csv', header=True)

# Save the dataframe as a delta table
delta_table_path = "/delta/mydata"
df.write.format("delta").save(delta_table_path)