Exam DP203 front-end services Databricks: Difference between revisions

From MillerSql.com
NeilM (talk | contribs)
No edit summary
NeilM (talk | contribs)
No edit summary
Line 37: Line 37:
</pre>df_sales = df_spark.toPandas()
</pre>df_sales = df_spark.toPandas()


== Spark Cluster ==
Types of Spark Cluster: '''Standard''', '''High Concurrency''' (for multi-user), '''Single Node''' (for testing)
Types of Spark Cluster: '''Standard''', '''High Concurrency''' (for multi-user), '''Single Node''' (for testing)
Specify the Databricks version.
Min Max number of worker nodes. Autoscaling yes or no.
Type of virtual machine for the driver nodes & worker nodes.

Revision as of 17:53, 21 November 2024

Second front-end_service is 2. Databricks.

Access the Databricks portal from the Azure Portal by going into the Databricks resouce and clicking to open the Databricks workspace.

Databricks supports Python, Scala, R, and Spark SQL, along with multiple machine learning frameworks.

Delta Lake

Governance: Unity catalog and Purview

 df = spark.sql("SELECT * FROM products")
 df = df.filter("Category == 'Road Bikes'")
 display(df)

Databricks File System (DBFS)

Matplotlib, Seaborn

filtered_df = df.filter(df["Age"] > 30)

install Python libraries such as Pandas, NumPy, or Scikit-learn. MLlib for machine learning.

# Create a sample DataFrame
data = [("Alice", 34), ("Bob", 45), ("Cathy", 29)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)

# Select columns
df.select("Name").show()

# Filter rows
df.filter(df["Age"] > 30).show()

# Group by and aggregate
df.groupBy("Age").count().show()

df_sales = df_spark.toPandas()

Spark Cluster

Types of Spark Cluster: Standard, High Concurrency (for multi-user), Single Node (for testing)

Specify the Databricks version.

Min Max number of worker nodes. Autoscaling yes or no.

Type of virtual machine for the driver nodes & worker nodes.