Exam DP203 front-end services Databricks
Second front-end_service is 2. Databricks.
Access the Databricks portal from the Azure Portal by going into the Databricks resouce and clicking to open the Databricks workspace.
Databricks supports Python, Scala, R, and Spark SQL, along with multiple machine learning frameworks.
Delta Lake
Governance: Unity catalog and Purview
df = spark.sql("SELECT * FROM products") df = df.filter("Category == 'Road Bikes'") display(df)
Databricks File System (DBFS)
Matplotlib, Seaborn
filtered_df = df.filter(df["Age"] > 30)
install Python libraries such as Pandas, NumPy, or Scikit-learn. MLlib for machine learning.
# Create a sample DataFrame data = [("Alice", 34), ("Bob", 45), ("Cathy", 29)] columns = ["Name", "Age"] df = spark.createDataFrame(data, columns) # Select columns df.select("Name").show() # Filter rows df.filter(df["Age"] > 30).show() # Group by and aggregate df.groupBy("Age").count().show()
df_sales = df_spark.toPandas()
Spark Cluster
Types of Spark Cluster: Standard, High Concurrency (for multi-user), Single Node (for testing)
Specify the Databricks version.
Min Max number of worker nodes. Autoscaling yes or no.
Type of virtual machine for the driver nodes & worker nodes.
Azure Kubernetes Service (AKS) is used to run the Azure Databricks control-plane.
Graphics
Matplotlib for graphics
from pyspark.sql.functions import col df = df.dropDuplicates() df = df.withColumn('Tax', col('UnitPrice') * 0.08) df = df.withColumn('Tax', col('Tax').cast("float"))
Databricks Delta Lake
ACID. Data versioning using the transaction log.
# Create a Delta table data = spark.range(0, 5) data.write.format("delta").save("/FileStore/tables/my_delta_table")
Delta Lake manages concurrent writes by ensuring that only one operation can commit its changes at a time
# Optimize the Delta table spark.sql("OPTIMIZE '/FileStore/tables/my_delta_table'") # Clean up old files spark.sql("VACUUM '/FileStore/tables/my_delta_table' RETAIN 168 HOURS")
Schema enforcement - incorrect datatypes are rejected.
MERGE statement.
The Describe History statement lists the recent updates to the table.
-- View table history DESCRIBE HISTORY person_data;
-- Query data as of version 0 SELECT * FROM person_data VERSION AS OF 0; -- Query data as of a specific timestamp SELECT * FROM person_data TIMESTAMP AS OF '2024-07-22T10:00:00Z'; -- Restore the table to version 0 RESTORE TABLE person_data TO VERSION AS OF 0; -- Restore the table to a specific timestamp RESTORE TABLE person_data TO TIMESTAMP AS OF '2024-07-22T10:00:00Z';