Exam 203 back-end services Spark: Difference between revisions
No edit summary |
No edit summary |
||
Line 7: | Line 7: | ||
<pre> | <pre> | ||
%%pyspark | %%pyspark | ||
df = spark.read.load('abfss://files@datalakexxxxxxx.dfs.core.windows.net/product_data/products.csv', format='csv' | df = spark.read.load('abfss://files@datalakexxxxxxx.dfs.core.windows.net/product_data/products.csv' | ||
, format='csv' | |||
, header=True | |||
) | ) | ||
display(df.limit(10)) | display(df.limit(10)) | ||
</pre> | </pre> | ||
Note the first time it runs it will take several minutes to complete because it takes the time to start up the Spark pool. | Note the first time it runs it will take several minutes to complete because it takes the time to start up the Spark pool. | ||
Then run: | |||
<pre> | <pre> |
Revision as of 19:50, 15 November 2024
The second back-end service is: Spark
Languages supported in Spark include Python, Scala, Java, SQL, and C#.
To run Spark code it is necessary in Synapse Studio to first create a Spark pool in the Manage - Apache Spark Pools tab. Then in the Develop tab, create a new Notebook, and select the Spark pool created to it. Then paste the following code into it and run:
%%pyspark df = spark.read.load('abfss://files@datalakexxxxxxx.dfs.core.windows.net/product_data/products.csv' , format='csv' , header=True ) display(df.limit(10))
Note the first time it runs it will take several minutes to complete because it takes the time to start up the Spark pool.
Then run:
df_counts = df.groupby(df.Category).count() display(df_counts)