Exam 203 back-end services Spark: Difference between revisions

From MillerSql.com
NeilM (talk | contribs)
No edit summary
NeilM (talk | contribs)
No edit summary
Line 22: Line 22:
display(df_counts)
display(df_counts)
</pre>Which can also be output as a chart.
</pre>Which can also be output as a chart.
== Overview ==
The SparkContext connects to the cluster manager, which allocates resources across applications using an implementation of Apache Hadoop YARN

Revision as of 12:37, 17 November 2024

The second back-end service is: Spark

Languages supported in Spark include Python, Scala, Java, SQL, and C#.

To run Spark code it is necessary in Synapse Studio to first create a Spark pool in the Manage - Apache Spark Pools tab. Then in the Develop tab, create a new Notebook, and select the Spark pool created to it. Then paste the following code into it and run:

%%pyspark
df = spark.read.load('abfss://files@datalakexxxxxxx.dfs.core.windows.net/product_data/products.csv'
, format='csv'
, header=True
)
display(df.limit(10))

Note the first time it runs it will take several minutes to complete because it takes the time to start up the Spark pool.

Then run:

df_counts = df.groupby(df.Category).count()
display(df_counts)

Which can also be output as a chart.

Overview

The SparkContext connects to the cluster manager, which allocates resources across applications using an implementation of Apache Hadoop YARN