Exam 203 back-end services Spark: Difference between revisions

From MillerSql.com
NeilM (talk | contribs)
No edit summary
NeilM (talk | contribs)
No edit summary
Line 7: Line 7:
<pre>
<pre>
%%pyspark
%%pyspark
df = spark.read.load('abfss://files@datalakexxxxxxx.dfs.core.windows.net/product_data/products.csv', format='csv'
df = spark.read.load('abfss://files@datalakexxxxxxx.dfs.core.windows.net/product_data/products.csv'
## If header exists uncomment line below
, format='csv'
##, header=True
, header=True
)
)
display(df.limit(10))
display(df.limit(10))
</pre>
</pre>
Note the first time it runs it will take several minutes to complete because it takes the time to start up the Spark pool.
Note the first time it runs it will take several minutes to complete because it takes the time to start up the Spark pool.
Then run:


<pre>
<pre>

Revision as of 19:50, 15 November 2024

The second back-end service is: Spark

Languages supported in Spark include Python, Scala, Java, SQL, and C#.

To run Spark code it is necessary in Synapse Studio to first create a Spark pool in the Manage - Apache Spark Pools tab. Then in the Develop tab, create a new Notebook, and select the Spark pool created to it. Then paste the following code into it and run:

%%pyspark
df = spark.read.load('abfss://files@datalakexxxxxxx.dfs.core.windows.net/product_data/products.csv'
, format='csv'
, header=True
)
display(df.limit(10))

Note the first time it runs it will take several minutes to complete because it takes the time to start up the Spark pool.

Then run:

df_counts = df.groupby(df.Category).count()
display(df_counts)