How to do Simple reporting with Excel sheets using Apache Spark, Scala ?
Ram Ghadiyaram
Technologist, Thought leader, Mentor, Innovator, Speaker - Cloud | Databricks | Snow Flake | Apache/PY Spark | Big Data | Analytics | AI | ML | LLM at JPMorgan Chase & Co
Question : Spark data can be published as excel sheet ?
Yes, It can be achived with a simple Spark plugin (crealytics/spark-excel ) for reading/writing Excel files ... A library (which uses Apache POI ) for dealing with Excel files with Apache Spark, for Spark SQL and DataFrames. Kudos for the effort!!!
Lets see small example with spark and scala... I used the examples in the link Credits for the example : https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-multi-dimensional-aggregation.html Thanks @jaceklaskowski !
I used maven project for this example. with scala 2.11 and spark 2.x versions
<!-- https://mvnrepository.com/artifact/com.crealytics/spark-excel --> <dependency> <groupId>com.crealytics</groupId> <artifactId>spark-excel_2.11</artifactId> <version>0.12.0</version> </dependency>
This is the code snippet i used to achieve the report with spark dataframes api
package com.examples import org.apache.log4j.Level import org.apache.spark.internal.Logging import org.apache.spark.sql.SparkSession import org.apache.spark.sql.functions._ object SparkExcelReport extends App with Logging { val logger = org.apache.log4j.Logger.getLogger("org") logger.setLevel(Level.WARN) val spark = SparkSession.builder() .appName(this.getClass.getName) .config("spark.master", "local[*]").getOrCreate() import spark.implicits._ val sales = Seq( ("Dallas", 2016, 100d), ("Dallas", 2017, 120d), ("Sanjose", 2017, 200d), ("Plano", 2015, 50d), ("Plano", 2016, 50d), ("Newyork", 2016, 150d), ("Toronto", 2017, 50d) ).toDF("city", "year", "saleAmount") sales.printSchema() logInfo( """ | | rollup multi-dimensional aggregate operator is an extension of groupBy operator | that calculates subtotals and a grand total across specified group of n + 1 dimensions | (with n being the number of columns as cols and col1 and 1 for where values become null, i.e. undefined). | rollup operator is commonly used for analysis over hierarchical data; e.g. total salary by department, division, and company-wide total. """.stripMargin) logInfo("Using Rollup (\"city\", \"year\")") val first = sales .rollup("city", "year") .agg(sum("saleAmount") as "saleAmount") .sort($"city".desc_nulls_last, $"year".asc_nulls_last) first.show first.coalesce(1).write .format("com.crealytics.spark.excel") .option("dataAddress", "'RollupCityYear'!A1:F35") .option("useHeader", "true") .option("dateFormat", "yy-mmm-d") .option("timestampFormat", "mm-dd-yyyy hh:mm:ss") .option("useHeader", "true") .option("addColorColumns", "true") .option("treatEmptyValuesAsNulls", "false") .mode("append") .save("./src/main/salesReport/SalesReport.xlsx") logInfo("Using .groupBy(\"city\", \"year\")") // The above query is semantically equivalent to the following val second = sales .groupBy("city", "year") // <-- subtotals (city, year) .agg(sum("saleAmount") as "saleAmount") second.coalesce(1).write .format("com.crealytics.spark.excel") .option("dataAddress", "'GroupByCityYearSheet'!A1:F35") .option("useHeader", "true") .option("dateFormat", "yy-mmm-d") .option("timestampFormat", "mm-dd-yyyy hh:mm:ss") .option("addColorColumns", "true") .option("treatEmptyValuesAsNulls", "false") .mode("append") .save("./src/main/salesReport/SalesReport.xlsx") second.show logInfo("Using group by (\"city\")") val third = sales .groupBy("city") // <-- subtotals (city) .agg(sum("saleAmount") as "saleAmount") .select($"city", lit(null) as "year", $"saleAmount") // <-- year is null third.show third.coalesce(1).write .format("com.crealytics.spark.excel") .option("dataAddress", "'GroupByCitySheet'!A1:F35") .option("useHeader", "true") .option("dateFormat", "yy-mmm-d") .option("timestampFormat", "mm-dd-yyyy hh:mm:ss") .option("treatEmptyValuesAsNulls", "false") .mode("append") .save("./src/main/salesReport/SalesReport.xlsx") val fourth = sales .groupBy() // <-- grand total .agg(sum("saleAmount") as "saleAmount") .select(lit(null) as "city", lit(null) as "year", $"saleAmount") // <-- city and year are null fourth.show fourth.coalesce(1).write .format("com.crealytics.spark.excel") .option("dataAddress", "'groupBySheet'!A1:F35") .option("useHeader", "true") .option("dateFormat", "yy-mmm-d") .option("timestampFormat", "mm-dd-yyyy hh:mm:ss") .option("treatEmptyValuesAsNulls", "false") .mode("append") .save("./src/main/salesReport/SalesReport.xlsx") val finalDF = second .union(third) .union(fourth) .sort($"city".desc_nulls_last, $"year".asc_nulls_last) finalDF.show finalDF.coalesce(1).write .format("com.crealytics.spark.excel") .option("dataAddress", "'unionSheet'!A1:F35") .option("useHeader", "true") .option("dateFormat", "yy-mmm-d") .option("timestampFormat", "mm-dd-yyyy hh:mm:ss") .option("treatEmptyValuesAsNulls", "false") .mode("append") .save("./src/main/salesReport/SalesReport.xlsx") }
Result :
root |-- city: string (nullable = true) |-- year: integer (nullable = false) |-- saleAmount: double (nullable = false) 2019-08-31 00:48:47 INFO SparkExcelReport:54 - rollup multi-dimensional aggregate operator is an extension of groupBy operator that calculates subtotals and a grand total across specified group of n + 1 dimensions (with n being the number of columns as cols and col1 and 1 for where values become null, i.e. undefined). rollup operator is commonly used for analysis over hierarchical data; e.g. total salary by department, division, and company-wide total. 2019-08-31 00:48:47 INFO SparkExcelReport:54 - Using Rollup ("city", "year") +-------+----+----------+ | city|year|saleAmount| +-------+----+----------+ |Toronto|2017| 50.0| |Toronto|null| 50.0| |Sanjose|2017| 200.0| |Sanjose|null| 200.0| | Plano|2015| 50.0| | Plano|2016| 50.0| | Plano|null| 100.0| |Newyork|2016| 150.0| |Newyork|null| 150.0| | Dallas|2016| 100.0| | Dallas|2017| 120.0| | Dallas|null| 220.0| | null|null| 720.0| +-------+----+----------+ 2019-08-31 00:48:55 INFO SparkExcelReport:54 - Using .groupBy("city", "year") +-------+----+----------+ | city|year|saleAmount| +-------+----+----------+ |Toronto|2017| 50.0| |Sanjose|2017| 200.0| | Dallas|2017| 120.0| | Plano|2015| 50.0| |Newyork|2016| 150.0| | Dallas|2016| 100.0| | Plano|2016| 50.0| +-------+----+----------+ 2019-08-31 00:48:57 INFO SparkExcelReport:54 - Using group by ("city") +-------+----+----------+ | city|year|saleAmount| +-------+----+----------+ | Dallas|null| 220.0| | Plano|null| 100.0| |Newyork|null| 150.0| |Toronto|null| 50.0| |Sanjose|null| 200.0| +-------+----+----------+ +----+----+----------+ |city|year|saleAmount| +----+----+----------+ |null|null| 720.0| +----+----+----------+ +-------+----+----------+ | city|year|saleAmount| +-------+----+----------+ |Toronto|2017| 50.0| |Toronto|null| 50.0| |Sanjose|2017| 200.0| |Sanjose|null| 200.0| | Plano|2015| 50.0| | Plano|2016| 50.0| | Plano|null| 100.0| |Newyork|2016| 150.0| |Newyork|null| 150.0| | Dallas|2016| 100.0| | Dallas|2017| 120.0| | Dallas|null| 220.0| | null|null| 720.0| +-------+----+----------+
Finally Excel sheet generated like these screen shots...
Conclusion : Finally we have genearted our excel reports. This approach is very good for applications which are looking for reports instantly.
If you like this article please dont forget to like it.