How to Add Custom Spark Listener Logs to the AWS EMR UI

How to Add Custom Spark Listener Logs to the AWS EMR UI


Introduction

AWS EMR (Elastic MapReduce) provides a powerful platform for processing large datasets using popular big data frameworks like Apache Spark. When executing Spark jobs on AWS EMR, the UI provides useful links to the standard stdout, stderr, and controller logs for each step, allowing easy access to the execution details of the jobs. However, there are cases when you may want to add additional logging specific to your Spark job, such as custom metrics collected by a SparkListener.

In this article, we will discuss how to redirect your custom SparkListener logs to appear in the AWS EMR UI alongside the existing stdout, stderr, and controller logs. This approach will allow you to log custom metrics, statistics, or insights during the job execution and have them easily accessible in the AWS EMR step UI.

Understanding AWS EMR Step Logs

By default, the AWS EMR UI provides links to the following log types for each Spark step:

  1. stdout: Standard output logs generated by your Spark application.
  2. stderr: Standard error logs, typically containing warnings or error messages.
  3. Controller logs: Logs that contain step-level information about the execution process.

These logs are automatically available in the EMR UI once a step completes or fails. However, there is no built-in mechanism to add additional custom log links to the EMR UI. Fortunately, there is a workaround: we can redirect custom logs to either stdout or stderr, which will automatically be linked in the EMR UI.

Objective

In this article, we will focus on:

  • Redirecting custom SparkListener logs to stderr so that they appear in the AWS EMR UI.
  • Logging additional information such as S3 log paths for custom logs collected by the listener.

By the end of this article, you'll have the knowledge to add your own custom logs to the EMR UI with minimal effort.

Step-by-Step Guide

Step 1: Set Up the SparkListener

First, create a custom SparkListener that captures the specific metrics or logs you want to monitor during the Spark job. For example, you might capture statistics such as the number of completed tasks, stages, job duration, or any specific business metrics.

Here’s a basic example of a SparkListener that collects and logs job statistics:

import org.apache.spark.scheduler.{SparkListener, SparkListenerJobEnd, SparkListenerJobStart}
import org.apache.log4j.Logger

class CustomSparkListener extends SparkListener {
  val logger: Logger = Logger.getLogger(getClass)

  override def onJobStart(jobStart: SparkListenerJobStart): Unit = {
    logger.info(s"Job ${jobStart.jobId} started at ${jobStart.time}")
  }

  override def onJobEnd(jobEnd: SparkListenerJobEnd): Unit = {
    logger.info(s"Job ${jobEnd.jobId} ended with result: ${jobEnd.jobResult}") // Log the path to custom listener logs stored in S3 
    val s3LogPath = "s3://<your-bucket>/spark_logs/listener.log"
    logger.error(s"Listener logs available at: $s3LogPath")
  }
}        



In this example, the CustomSparkListener logs the job start and end events, along with a link to the S3 location where additional logs are stored.

Step 2: Modify log4j.properties to Redirect Logs to stderr

To ensure that the logs generated by your listener appear in the EMR UI under the stderr link, you need to configure Log4j to direct the logs to stderr. This can be done by modifying the log4j.properties file in your project.

Here’s an example of a log4j.properties file that sends logs to stderr:

# Root logger option log4j.rootLogger=INFO, console 
# Redirect logs to console (stderr) log4j.appender.console=org.apache.log4j.ConsoleAppender log4j.appender.console.target=System.err log4j.appender.console.layout=org.apache.log4j.PatternLayout log4j.appender.console.layout.ConversionPattern=%d{ISO8601} [%t] %-5p %c - %m%n        



This configuration sends all logs (including those from your SparkListener) to stderr, ensuring they are displayed in the EMR UI under the stderr link.

Step 3: Configure Your Spark Job to Use the Custom Listener

Now, you need to make sure your Spark job uses the custom listener during execution. You can add the listener to your Spark job using the spark.extraListeners configuration option.

For example, if you’re submitting your job via spark-submit, you can specify the listener as follows:

spark-submit \ 
--class com.example.YourSparkJob \
 --conf spark.extraListeners=com.example.CustomSparkListener
 \ your-spark-job.jar        

This ensures that the CustomSparkListener will be loaded and active throughout the job execution.

Step 4: Access Custom Logs in the EMR UI

Once the job is complete (or fails), navigate to the Steps section of your EMR cluster in the AWS Management Console. Under the specific step you ran, you’ll find links to the stdout, stderr, and controller logs.

Since we’ve redirected the custom logs from the CustomSparkListener to stderr, you can access them by clicking the stderr link in the EMR UI. In this log, you’ll also find the custom S3 path where the listener logs are stored.

Example of what you might see in the stderr log:

INFO 2024-10-13 10:15:00,000 [main] com.example.CustomSparkListener - Job 0 started at 1609459200000
 INFO 2024-10-13 10:25:00,000 [main] com.example.CustomSparkListener - Job 0 ended with result: JobSucceeded 
ERROR 2024-10-13 10:25:00,000 [main] com.example.CustomSparkListener - Listener logs available at: s3://<your-bucket>/spark_logs/listener.log
        

Step 5: Sync Logs to S3 (Optional)

If you want the logs to be available in S3, you can configure log synchronization between the cluster and S3. By default, EMR syncs stderr and stdout logs to S3. If you’ve configured EMR to write logs to a custom S3 location, you can use this same location for storing the listener logs.

Ensure the following configuration is set during cluster creation:

aws emr create-cluster 
\ --log-uri s3://<your-bucket>/spark_logs/
 \ --enable-debugging        

This ensures that both system and custom logs are synced to the specified S3 location.

Conclusion

In this article, we demonstrated how to add custom logs generated by a SparkListener to the AWS EMR step UI. By configuring Log4j to redirect logs to stderr, we were able to leverage the existing EMR UI capabilities to display custom metrics and logs without modifying the actual Spark job code.

This approach allows you to capture and access detailed job execution metrics, including S3 log paths, in a non-intrusive way. By ensuring that the logs are printed to stderr, they become accessible alongside the standard output and error logs in the EMR UI, providing deeper insights into job performance and execution.

This technique can be particularly useful when running large-scale Spark applications in production on AWS EMR, where easy access to detailed logs is essential for debugging and performance tuning.

要查看或添加评论,请登录

Ram Ghadiyaram的更多文章

社区洞察

其他会员也浏览了