登录查看更多内容

How to do Simple reporting with Excel sheets using Apache Spark, Scala ?

Ram Ghadiyaram

Technologist, Thought leader, Mentor, Innovator, Speaker - Cloud | Databricks | Snow Flake | Apache/PY Spark | Big Data | Analytics | AI | ML | LLM at JPMorgan Chase & Co

发布日期: 2019年8月31日

Question : Spark data can be published as excel sheet ?

Yes, It can be achived with a simple Spark plugin (crealytics/spark-excel ) for reading/writing Excel files ... A library (which uses Apache POI ) for dealing with Excel files with Apache Spark, for Spark SQL and DataFrames. Kudos for the effort!!!

Lets see small example with spark and scala... I used the examples in the link Credits for the example : https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-multi-dimensional-aggregation.html Thanks @jaceklaskowski !

I used maven project for this example. with scala 2.11 and spark 2.x versions

<!-- https://mvnrepository.com/artifact/com.crealytics/spark-excel -->
<dependency>
    <groupId>com.crealytics</groupId>
    <artifactId>spark-excel_2.11</artifactId>
    <version>0.12.0</version>
</dependency>

This is the code snippet i used to achieve the report with spark dataframes api

package com.examples

import org.apache.log4j.Level
import org.apache.spark.internal.Logging
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

object SparkExcelReport extends App with Logging {
  val logger = org.apache.log4j.Logger.getLogger("org")
  logger.setLevel(Level.WARN)
  val spark = SparkSession.builder()
    .appName(this.getClass.getName)
    .config("spark.master", "local[*]").getOrCreate()

  import spark.implicits._

  val sales = Seq(
    ("Dallas", 2016, 100d),
    ("Dallas", 2017, 120d),
    ("Sanjose", 2017, 200d),
    ("Plano", 2015, 50d),
    ("Plano", 2016, 50d),
    ("Newyork", 2016, 150d),
    ("Toronto", 2017, 50d)

  ).toDF("city", "year", "saleAmount")
  sales.printSchema()

  logInfo(
    """
      |
      | rollup multi-dimensional aggregate operator is an extension of groupBy operator
      | that calculates subtotals and a grand total across specified group of n + 1 dimensions
      | (with n being the number of columns as cols and col1 and 1 for where values become null, i.e. undefined).
      | rollup operator is commonly used for analysis over hierarchical data; e.g. total salary by department, division, and company-wide total.
    """.stripMargin)
  logInfo("Using Rollup (\"city\", \"year\")")
  val first = sales
    .rollup("city", "year")
    .agg(sum("saleAmount") as "saleAmount")
    .sort($"city".desc_nulls_last, $"year".asc_nulls_last)
  first.show
  first.coalesce(1).write
    .format("com.crealytics.spark.excel")
    .option("dataAddress", "'RollupCityYear'!A1:F35")
    .option("useHeader", "true")
    .option("dateFormat", "yy-mmm-d")
    .option("timestampFormat", "mm-dd-yyyy hh:mm:ss")
    .option("useHeader", "true")
    .option("addColorColumns", "true")
    .option("treatEmptyValuesAsNulls", "false")
    .mode("append")
    .save("./src/main/salesReport/SalesReport.xlsx")
  logInfo("Using .groupBy(\"city\", \"year\")")

  // The above query is semantically equivalent to the following
  val second = sales
    .groupBy("city", "year") // <-- subtotals (city, year)
    .agg(sum("saleAmount") as "saleAmount")

  second.coalesce(1).write
    .format("com.crealytics.spark.excel")
    .option("dataAddress", "'GroupByCityYearSheet'!A1:F35")
    .option("useHeader", "true")
    .option("dateFormat", "yy-mmm-d")
    .option("timestampFormat", "mm-dd-yyyy hh:mm:ss")
    .option("addColorColumns", "true")
    .option("treatEmptyValuesAsNulls", "false")
    .mode("append")
    .save("./src/main/salesReport/SalesReport.xlsx")
  second.show
  logInfo("Using group by (\"city\")")

  val third = sales
    .groupBy("city") // <-- subtotals (city)
    .agg(sum("saleAmount") as "saleAmount")
    .select($"city", lit(null) as "year", $"saleAmount") // <-- year is null
  third.show
  third.coalesce(1).write
    .format("com.crealytics.spark.excel")
    .option("dataAddress", "'GroupByCitySheet'!A1:F35")
    .option("useHeader", "true")
    .option("dateFormat", "yy-mmm-d")
    .option("timestampFormat", "mm-dd-yyyy hh:mm:ss")
    .option("treatEmptyValuesAsNulls", "false")
    .mode("append")
    .save("./src/main/salesReport/SalesReport.xlsx")

  val fourth = sales
    .groupBy() // <-- grand total
    .agg(sum("saleAmount") as "saleAmount")
    .select(lit(null) as "city", lit(null) as "year", $"saleAmount") // <-- city and year are null

  fourth.show

  fourth.coalesce(1).write
    .format("com.crealytics.spark.excel")
    .option("dataAddress", "'groupBySheet'!A1:F35")
    .option("useHeader", "true")
    .option("dateFormat", "yy-mmm-d")
    .option("timestampFormat", "mm-dd-yyyy hh:mm:ss")
    .option("treatEmptyValuesAsNulls", "false")
    .mode("append")
    .save("./src/main/salesReport/SalesReport.xlsx")

  val finalDF = second
    .union(third)
    .union(fourth)
    .sort($"city".desc_nulls_last, $"year".asc_nulls_last)
  finalDF.show

  finalDF.coalesce(1).write
    .format("com.crealytics.spark.excel")
    .option("dataAddress", "'unionSheet'!A1:F35")
    .option("useHeader", "true")
    .option("dateFormat", "yy-mmm-d")
    .option("timestampFormat", "mm-dd-yyyy hh:mm:ss")
    .option("treatEmptyValuesAsNulls", "false")
    .mode("append")
    .save("./src/main/salesReport/SalesReport.xlsx")

}

Result :

root
 |-- city: string (nullable = true)
 |-- year: integer (nullable = false)
 |-- saleAmount: double (nullable = false)


2019-08-31 00:48:47 INFO  SparkExcelReport:54 - 


 rollup multi-dimensional aggregate operator is an extension of groupBy operator
 that calculates subtotals and a grand total across specified group of n + 1 dimensions
 (with n being the number of columns as cols and col1 and 1 for where values become null, i.e. undefined).
 rollup operator is commonly used for analysis over hierarchical data; e.g. total salary by department, division, and company-wide total.
    
2019-08-31 00:48:47 INFO  SparkExcelReport:54 - Using Rollup ("city", "year")
+-------+----+----------+
|   city|year|saleAmount|
+-------+----+----------+
|Toronto|2017|      50.0|
|Toronto|null|      50.0|
|Sanjose|2017|     200.0|
|Sanjose|null|     200.0|
|  Plano|2015|      50.0|
|  Plano|2016|      50.0|
|  Plano|null|     100.0|
|Newyork|2016|     150.0|
|Newyork|null|     150.0|
| Dallas|2016|     100.0|
| Dallas|2017|     120.0|
| Dallas|null|     220.0|
|   null|null|     720.0|
+-------+----+----------+


2019-08-31 00:48:55 INFO  SparkExcelReport:54 - Using .groupBy("city", "year")
+-------+----+----------+
|   city|year|saleAmount|
+-------+----+----------+
|Toronto|2017|      50.0|
|Sanjose|2017|     200.0|
| Dallas|2017|     120.0|
|  Plano|2015|      50.0|
|Newyork|2016|     150.0|
| Dallas|2016|     100.0|
|  Plano|2016|      50.0|
+-------+----+----------+


2019-08-31 00:48:57 INFO  SparkExcelReport:54 - Using group by ("city")
+-------+----+----------+
|   city|year|saleAmount|
+-------+----+----------+
| Dallas|null|     220.0|
|  Plano|null|     100.0|
|Newyork|null|     150.0|
|Toronto|null|      50.0|
|Sanjose|null|     200.0|
+-------+----+----------+


+----+----+----------+
|city|year|saleAmount|
+----+----+----------+
|null|null|     720.0|
+----+----+----------+


+-------+----+----------+
|   city|year|saleAmount|
+-------+----+----------+
|Toronto|2017|      50.0|
|Toronto|null|      50.0|
|Sanjose|2017|     200.0|
|Sanjose|null|     200.0|
|  Plano|2015|      50.0|
|  Plano|2016|      50.0|
|  Plano|null|     100.0|
|Newyork|2016|     150.0|
|Newyork|null|     150.0|
| Dallas|2016|     100.0|
| Dallas|2017|     120.0|
| Dallas|null|     220.0|
|   null|null|     720.0|
+-------+----+----------+

Finally Excel sheet generated like these screen shots...

Conclusion : Finally we have genearted our excel reports. This approach is very good for applications which are looking for reports instantly.

If you like this article please dont forget to like it.

要查看或添加评论，请登录

Ram Ghadiyaram的更多文章

My thought to find sql and durations (generic python module) and without touching the pyspark business code which is already deployed

2024年11月17日

My thought to find sql and durations (generic python module) and without touching the pyspark business code which is already deployed

PROBLEM : Traditionally, tracking SQL, execution durations in PySpark meant adding timing code to each query. This…
How to Add Custom Spark Listener Logs to the AWS EMR UI

2024年10月19日

How to Add Custom Spark Listener Logs to the AWS EMR UI

Introduction AWS EMR (Elastic MapReduce) provides a powerful platform for processing large datasets using popular big…
How to generate csv from impala/hive console output to csv using pyspark

2023年10月12日

How to generate csv from impala/hive console output to csv using pyspark

I have employee table in hive, I will execute a query in hive employee table which will give console output like this I…
Delta lake insights

2023年10月4日

Delta lake insights

Delta Lake, as an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads…
Spark SQL/Hive.. - Interview questions for Big Data engineers

2020年12月27日

Spark SQL/Hive.. - Interview questions for Big Data engineers

Note : Gave an attempt to cover some of the use cases/concepts here. Will keep on adding in the future.
Spark SQL Window functions using plain SQL.

2020年11月29日

Spark SQL Window functions using plain SQL.

Spark got several window functions, which are..
Apache Spark - Advanced Aggregations

2020年11月17日

Apache Spark - Advanced Aggregations

GROUP BY operation to perform aggregations in our queries. Consider the case where we have data with retail store…
Hadoop Yarn Fair scheduler advantages.. explained... configuration part2

2019年7月30日

Hadoop Yarn Fair scheduler advantages.. explained... configuration part2

This is continuation to my previous post I used AWS EMR cluster to configure this and is 10 minutes read..
Hadoop Yarn Fair scheduler advantages.. explained... part1

2019年7月22日

Hadoop Yarn Fair scheduler advantages.. explained... part1

What is Fair : Keywords: Hadoop, MapReduce, task scheduling, yet another resource negotiator, YARN, Hadoop distributed…

See all articles

How to do Simple reporting with Excel sheets using Apache Spark, Scala ?

Ram Ghadiyaram

Technologist, Thought leader, Mentor, Innovator, Speaker - Cloud | Databricks | Snow Flake | Apache/PY Spark | Big Data | Analytics | AI | ML | LLM at JPMorgan Chase & Co

Ram Ghadiyaram的更多文章

社区洞察

其他会员也浏览了

Building a simple ETL Pipeline in PySpark and S3 persistence: A SOLID Approach

Mastering Spark Session Creation and Configuration in Apache Spark

Performing DML Operations on Apache Iceberg ?? Tables in Jupyter Notebook with MinIO

Mastering DataFrame Transformations in Apache Spark

Simplifying Apache Spark usage with Optimus

Step-by-Step Guide: Connecting SQL Server Database to Python with pyodbc

Governing Apache Ranger

Generate and Insert Massive Data into SQLite Databases with Ease

Practical Apache Spark in 10 minutes. Part 3?-?DataFrames and?SQL

Ram Ghadiyaram的更多文章

My thought to find sql and durations (generic python module) and without touching the pyspark business code which is already deployed

How to Add Custom Spark Listener Logs to the AWS EMR UI

How to generate csv from impala/hive console output to csv using pyspark

Delta lake insights

Spark SQL/Hive.. - Interview questions for Big Data engineers

Spark SQL Window functions using plain SQL.

Apache Spark - Advanced Aggregations

Hadoop Yarn Fair scheduler advantages.. explained... configuration part2

Hadoop Yarn Fair scheduler advantages.. explained... part1

社区洞察

其他会员也浏览了

Building a simple ETL Pipeline in PySpark and S3 persistence: A SOLID Approach

Mastering Spark Session Creation and Configuration in Apache Spark

Performing DML Operations on Apache Iceberg ?? Tables in Jupyter Notebook with MinIO

Mastering DataFrame Transformations in Apache Spark

Simplifying Apache Spark usage with Optimus

Step-by-Step Guide: Connecting SQL Server Database to Python with pyodbc

Governing Apache Ranger

Generate and Insert Massive Data into SQLite Databases with Ease

Practical Apache Spark in 10 minutes. Part 3?-?DataFrames and?SQL