Exam 203 back-end services Spark: Difference between revisions
Line 33: | Line 33: | ||
When creating the Spark Pool, choose the '''size''' of virtual machine (VM) used for the nodes in the pool, including the option to use hardware accelerated GPU-enabled nodes., the '''number''' of nodes, and the '''version''' of Spark. | When creating the Spark Pool, choose the '''size''' of virtual machine (VM) used for the nodes in the pool, including the option to use hardware accelerated GPU-enabled nodes., the '''number''' of nodes, and the '''version''' of Spark. | ||
Synapse Studio '''notebooks''' with cells. |
Revision as of 12:45, 17 November 2024
The second back-end service is: Spark
Languages supported in Spark include Python, Scala, Java, SQL, and C#.
To run Spark code it is necessary in Synapse Studio to first create a Spark pool in the Manage - Apache Spark Pools tab. Then in the Develop tab, create a new Notebook, and select the Spark pool created to it. Then paste the following code into it and run:
%%pyspark df = spark.read.load('abfss://files@datalakexxxxxxx.dfs.core.windows.net/product_data/products.csv' , format='csv' , header=True ) display(df.limit(10))
Note the first time it runs it will take several minutes to complete because it takes the time to start up the Spark pool.
Then run:
df_counts = df.groupby(df.Category).count() display(df_counts)
Which can also be output as a chart.
Overview
The SparkContext connects to the cluster manager, which allocates resources across applications using an implementation of Apache Hadoop YARN
The nodes read and write data from and to the file system and cache transformed data in-memory as Resilient Distributed Datasets (RDDs).
Cluster Manager assigns work to many executors.
The SparkContext is responsible for converting an application to a directed acyclic graph (DAG).
When creating the Spark Pool, choose the size of virtual machine (VM) used for the nodes in the pool, including the option to use hardware accelerated GPU-enabled nodes., the number of nodes, and the version of Spark.
Synapse Studio notebooks with cells.