Exam DP203 front-end services Databricks

From MillerSql.com
Revision as of 23:18, 23 December 2024 by NeilM (talk | contribs) (→‎Databricks Modes)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Second front-end_service is 2. Databricks.

Access the Databricks portal from the Azure Portal by going into the Databricks resouce and clicking to open the Databricks workspace.

Databricks supports Python, Scala, R, and Spark SQL, along with multiple machine learning frameworks.

Delta Lake

Governance: Unity catalog and Purview

 df = spark.sql("SELECT * FROM products")
 df = df.filter("Category == 'Road Bikes'")
 display(df)

Databricks File System (DBFS)

Matplotlib, Seaborn

filtered_df = df.filter(df["Age"] > 30)

install Python libraries such as Pandas, NumPy, or Scikit-learn. MLlib for machine learning.

# Create a sample DataFrame
data = [("Alice", 34), ("Bob", 45), ("Cathy", 29)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)

# Select columns
df.select("Name").show()

# Filter rows
df.filter(df["Age"] > 30).show()

# Group by and aggregate
df.groupBy("Age").count().show()

df_sales = df_spark.toPandas()

Spark Cluster

Types of Spark Cluster: Standard, High Concurrency (for multi-user), Single Node (for testing)

Specify the Databricks version.

Min Max number of worker nodes. Autoscaling yes or no.

Type of virtual machine for the driver nodes & worker nodes.

Azure Kubernetes Service (AKS) is used to run the Azure Databricks control-plane.

Graphics

Matplotlib for graphics

from pyspark.sql.functions import col
 df = df.dropDuplicates()
 df = df.withColumn('Tax', col('UnitPrice') * 0.08)
 df = df.withColumn('Tax', col('Tax').cast("float"))

Databricks Delta Lake

ACID. Data versioning using the transaction log.

# Create a Delta table
data = spark.range(0, 5)
data.write.format("delta").save("/FileStore/tables/my_delta_table")

Delta Lake manages concurrent writes by ensuring that only one operation can commit its changes at a time

# Optimize the Delta table
spark.sql("OPTIMIZE '/FileStore/tables/my_delta_table'")

# Clean up old files
spark.sql("VACUUM '/FileStore/tables/my_delta_table' RETAIN 168 HOURS")

Schema enforcement - incorrect datatypes are rejected.

MERGE statement.

The Describe History statement lists the recent updates to the table.

-- View table history
DESCRIBE HISTORY person_data;


-- Query data as of version 0
SELECT * FROM person_data VERSION AS OF 0;

-- Query data as of a specific timestamp
SELECT * FROM person_data TIMESTAMP AS OF '2024-07-22T10:00:00Z';

-- Restore the table to version 0
RESTORE TABLE person_data TO VERSION AS OF 0;

-- Restore the table to a specific timestamp
RESTORE TABLE person_data TO TIMESTAMP AS OF '2024-07-22T10:00:00Z';
from delta.tables import *
from pyspark.sql.functions import *
   
# Create a deltaTable object
deltaTable = DeltaTable.forPath(spark, delta_table_path)
# Update the table (reduce price of product 771 by 10%)
deltaTable.update(
    condition = "ProductID == 771",
    set = { "ListPrice": "ListPrice * 0.9" })
# View the updated data as a dataframe
deltaTable.toDF().show(10)

Delta Live Tables

-- Load customer data from CSV
CREATE OR REPLACE TEMPORARY VIEW customer_data_view AS
SELECT * FROM csv.`/path/to/customer_data.csv`
OPTIONS (header "true", inferSchema "true");

-- Insert data into customer Delta table
INSERT INTO customer_data
SELECT * FROM customer_data_view;

-- Load transaction data from JSON
CREATE OR REPLACE TEMPORARY VIEW transaction_data_view AS
SELECT * FROM json.`/path/to/transaction_data.json`;

-- Insert data into transaction Delta table
INSERT INTO transaction_data
SELECT * FROM transaction_data_view;
-- Create a unified view of customer transactions
CREATE OR REPLACE TEMPORARY VIEW customer_transactions AS
SELECT
    t.transaction_id,
    t.customer_id,
    c.customer_name,
    c.email,
    t.transaction_date,
    t.amount
FROM
    transaction_data t
JOIN
    customer_data c
ON
    t.customer_id = c.customer_id;

-- Create a Delta table for the integrated data
CREATE TABLE integrated_data USING DELTA AS
SELECT * FROM customer_transactions;

Azure Databricks Workflows

Menu item Workflows

Databricks Modes

  1. Single User
  2. Shared (allows multiple users). Provides isolation between users
  3. Shared No Isolation.

Databricks Runtime Version

The version of databricks to be used.