登录查看更多内容

How to Add Custom Spark Listener Logs to the AWS EMR UI

Ram Ghadiyaram

Technologist, Thought leader, Mentor, Innovator, Speaker - Cloud | Databricks | Snow Flake | Apache/PY Spark | Big Data | Analytics | AI | ML | LLM at JPMorgan Chase & Co

发布日期: 2024年10月19日

Introduction

AWS EMR (Elastic MapReduce) provides a powerful platform for processing large datasets using popular big data frameworks like Apache Spark. When executing Spark jobs on AWS EMR, the UI provides useful links to the standard stdout, stderr, and controller logs for each step, allowing easy access to the execution details of the jobs. However, there are cases when you may want to add additional logging specific to your Spark job, such as custom metrics collected by a SparkListener.

In this article, we will discuss how to redirect your custom SparkListener logs to appear in the AWS EMR UI alongside the existing stdout, stderr, and controller logs. This approach will allow you to log custom metrics, statistics, or insights during the job execution and have them easily accessible in the AWS EMR step UI.

Understanding AWS EMR Step Logs

By default, the AWS EMR UI provides links to the following log types for each Spark step:

stdout: Standard output logs generated by your Spark application.
stderr: Standard error logs, typically containing warnings or error messages.
Controller logs: Logs that contain step-level information about the execution process.

These logs are automatically available in the EMR UI once a step completes or fails. However, there is no built-in mechanism to add additional custom log links to the EMR UI. Fortunately, there is a workaround: we can redirect custom logs to either stdout or stderr, which will automatically be linked in the EMR UI.

Objective

In this article, we will focus on:

Redirecting custom SparkListener logs to stderr so that they appear in the AWS EMR UI.
Logging additional information such as S3 log paths for custom logs collected by the listener.

By the end of this article, you'll have the knowledge to add your own custom logs to the EMR UI with minimal effort.

Step-by-Step Guide

Step 1: Set Up the SparkListener

First, create a custom SparkListener that captures the specific metrics or logs you want to monitor during the Spark job. For example, you might capture statistics such as the number of completed tasks, stages, job duration, or any specific business metrics.

Here’s a basic example of a SparkListener that collects and logs job statistics:

import org.apache.spark.scheduler.{SparkListener, SparkListenerJobEnd, SparkListenerJobStart}
import org.apache.log4j.Logger

class CustomSparkListener extends SparkListener {
  val logger: Logger = Logger.getLogger(getClass)

  override def onJobStart(jobStart: SparkListenerJobStart): Unit = {
    logger.info(s"Job ${jobStart.jobId} started at ${jobStart.time}")
  }

  override def onJobEnd(jobEnd: SparkListenerJobEnd): Unit = {
    logger.info(s"Job ${jobEnd.jobId} ended with result: ${jobEnd.jobResult}") // Log the path to custom listener logs stored in S3 
    val s3LogPath = "s3://<your-bucket>/spark_logs/listener.log"
    logger.error(s"Listener logs available at: $s3LogPath")
  }
}

In this example, the CustomSparkListener logs the job start and end events, along with a link to the S3 location where additional logs are stored.

Step 2: Modify log4j.properties to Redirect Logs to stderr

To ensure that the logs generated by your listener appear in the EMR UI under the stderr link, you need to configure Log4j to direct the logs to stderr. This can be done by modifying the log4j.properties file in your project.

Here’s an example of a log4j.properties file that sends logs to stderr:

领英推荐

AWS Glue vs. AWS DataSync: Choosing the Right Data…

WorkiFicient Technologies Pvt Ltd 9 个月前

How to use Databricks in Azure Environment: A…

Peruzzi Solutions Limited 12 个月前

Apache Spark on Azure

Anuradha Nanayakkara 2 个月前

# Root logger option log4j.rootLogger=INFO, console 
# Redirect logs to console (stderr) log4j.appender.console=org.apache.log4j.ConsoleAppender log4j.appender.console.target=System.err log4j.appender.console.layout=org.apache.log4j.PatternLayout log4j.appender.console.layout.ConversionPattern=%d{ISO8601} [%t] %-5p %c - %m%n

This configuration sends all logs (including those from your SparkListener) to stderr, ensuring they are displayed in the EMR UI under the stderr link.

Step 3: Configure Your Spark Job to Use the Custom Listener

Now, you need to make sure your Spark job uses the custom listener during execution. You can add the listener to your Spark job using the spark.extraListeners configuration option.

For example, if you’re submitting your job via spark-submit, you can specify the listener as follows:

spark-submit \ 
--class com.example.YourSparkJob \
 --conf spark.extraListeners=com.example.CustomSparkListener
 \ your-spark-job.jar

This ensures that the CustomSparkListener will be loaded and active throughout the job execution.

Step 4: Access Custom Logs in the EMR UI

Once the job is complete (or fails), navigate to the Steps section of your EMR cluster in the AWS Management Console. Under the specific step you ran, you’ll find links to the stdout, stderr, and controller logs.

Since we’ve redirected the custom logs from the CustomSparkListener to stderr, you can access them by clicking the stderr link in the EMR UI. In this log, you’ll also find the custom S3 path where the listener logs are stored.

Example of what you might see in the stderr log:

INFO 2024-10-13 10:15:00,000 [main] com.example.CustomSparkListener - Job 0 started at 1609459200000
 INFO 2024-10-13 10:25:00,000 [main] com.example.CustomSparkListener - Job 0 ended with result: JobSucceeded 
ERROR 2024-10-13 10:25:00,000 [main] com.example.CustomSparkListener - Listener logs available at: s3://<your-bucket>/spark_logs/listener.log

Step 5: Sync Logs to S3 (Optional)

If you want the logs to be available in S3, you can configure log synchronization between the cluster and S3. By default, EMR syncs stderr and stdout logs to S3. If you’ve configured EMR to write logs to a custom S3 location, you can use this same location for storing the listener logs.

Ensure the following configuration is set during cluster creation:

aws emr create-cluster 
\ --log-uri s3://<your-bucket>/spark_logs/
 \ --enable-debugging

This ensures that both system and custom logs are synced to the specified S3 location.

Conclusion

In this article, we demonstrated how to add custom logs generated by a SparkListener to the AWS EMR step UI. By configuring Log4j to redirect logs to stderr, we were able to leverage the existing EMR UI capabilities to display custom metrics and logs without modifying the actual Spark job code.

This approach allows you to capture and access detailed job execution metrics, including S3 log paths, in a non-intrusive way. By ensuring that the logs are printed to stderr, they become accessible alongside the standard output and error logs in the EMR UI, providing deeper insights into job performance and execution.

This technique can be particularly useful when running large-scale Spark applications in production on AWS EMR, where easy access to detailed logs is essential for debugging and performance tuning.

要查看或添加评论，请登录

Ram Ghadiyaram的更多文章

My thought to find sql and durations (generic python module) and without touching the pyspark business code which is already deployed

2024年11月17日

My thought to find sql and durations (generic python module) and without touching the pyspark business code which is already deployed

PROBLEM : Traditionally, tracking SQL, execution durations in PySpark meant adding timing code to each query. This…
How to generate csv from impala/hive console output to csv using pyspark

2023年10月12日

How to generate csv from impala/hive console output to csv using pyspark

I have employee table in hive, I will execute a query in hive employee table which will give console output like this I…
Delta lake insights

2023年10月4日

Delta lake insights

Delta Lake, as an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads…
Spark SQL/Hive.. - Interview questions for Big Data engineers

2020年12月27日

Spark SQL/Hive.. - Interview questions for Big Data engineers

Note : Gave an attempt to cover some of the use cases/concepts here. Will keep on adding in the future.
Spark SQL Window functions using plain SQL.

2020年11月29日

Spark SQL Window functions using plain SQL.

Spark got several window functions, which are..
Apache Spark - Advanced Aggregations

2020年11月17日

Apache Spark - Advanced Aggregations

GROUP BY operation to perform aggregations in our queries. Consider the case where we have data with retail store…
How to do Simple reporting with Excel sheets using Apache Spark, Scala ?

2019年8月31日

How to do Simple reporting with Excel sheets using Apache Spark, Scala ?

Question : Spark data can be published as excel sheet ? Yes, It can be achived with a simple Spark plugin…
Hadoop Yarn Fair scheduler advantages.. explained... configuration part2

2019年7月30日

Hadoop Yarn Fair scheduler advantages.. explained... configuration part2

This is continuation to my previous post I used AWS EMR cluster to configure this and is 10 minutes read..
Hadoop Yarn Fair scheduler advantages.. explained... part1

2019年7月22日

Hadoop Yarn Fair scheduler advantages.. explained... part1

What is Fair : Keywords: Hadoop, MapReduce, task scheduling, yet another resource negotiator, YARN, Hadoop distributed…

See all articles

How to Add Custom Spark Listener Logs to the AWS EMR UI

Ram Ghadiyaram

Technologist, Thought leader, Mentor, Innovator, Speaker - Cloud | Databricks | Snow Flake | Apache/PY Spark | Big Data | Analytics | AI | ML | LLM at JPMorgan Chase & Co

Introduction

Understanding AWS EMR Step Logs

Objective

Step-by-Step Guide

Step 1: Set Up the SparkListener

Step 2: Modify log4j.properties to Redirect Logs to stderr

领英推荐

Step 3: Configure Your Spark Job to Use the Custom Listener

Step 4: Access Custom Logs in the EMR UI

Step 5: Sync Logs to S3 (Optional)

Conclusion

Ram Ghadiyaram的更多文章

社区洞察

其他会员也浏览了

Amazon EMR Overview

Build a Serverless Real-Time Data Processing App

Week 10 (4 Mar - 10 Mar)

CIO Strategy for AWS Big Data Implementation

DATA Pill #068 - Amazon S3, Athena & AWS Glue ??Iceberg, ClickHouse ?? DuckDB = OLAP2

Data Archtechure on AWS

Bigquery

Big Data - AWS, Azure, GCP Offerings

DataBricks

Real time data processing: easily processing 10 million messages with Golang, Kafka and MongoDB

Introduction

Understanding AWS EMR Step Logs

Objective

Step-by-Step Guide

Step 1: Set Up the SparkListener

Step 2: Modify log4j.properties to Redirect Logs to stderr

领英推荐

Step 3: Configure Your Spark Job to Use the Custom Listener

Step 4: Access Custom Logs in the EMR UI

Step 5: Sync Logs to S3 (Optional)

Conclusion

Ram Ghadiyaram的更多文章

My thought to find sql and durations (generic python module) and without touching the pyspark business code which is already deployed

How to generate csv from impala/hive console output to csv using pyspark

Delta lake insights

Spark SQL/Hive.. - Interview questions for Big Data engineers

Spark SQL Window functions using plain SQL.

Apache Spark - Advanced Aggregations

How to do Simple reporting with Excel sheets using Apache Spark, Scala ?

Hadoop Yarn Fair scheduler advantages.. explained... configuration part2

Hadoop Yarn Fair scheduler advantages.. explained... part1

社区洞察

其他会员也浏览了

Amazon EMR Overview

Build a Serverless Real-Time Data Processing App

Week 10 (4 Mar - 10 Mar)

CIO Strategy for AWS Big Data Implementation

DATA Pill #068 - Amazon S3, Athena & AWS Glue ??Iceberg, ClickHouse ?? DuckDB = OLAP2

Data Archtechure on AWS

Bigquery

Big Data - AWS, Azure, GCP Offerings

DataBricks

Real time data processing: easily processing 10 million messages with Golang, Kafka and MongoDB