Integrating Spark with Spring Boot

Integrating Spark with Spring Boot

Source: https://relishcode.com

Off late, I have started relying more and more on Spring Boot for increased productivity. It is much faster to get the boiler plate stuff out of the way using Spring Boot.

For example:

  • Automatic configuration for application dependencies like Spring REST, JPA
  • Starter dependencies that save lot of time while setting up project
  • The application.properties which I leverage to externalize my application configuration with default values.
  • Executable jars with embedded container out of the box, which save lot of time for prototypes.

For one of my project, I needed to use Apache Spark and started missing Spring Boot from day one. It took me some time to get both of these working together and felt its worth capturing in a blog.

Problem # 1

If you simply include Spark and Spring Boot dependencies in the same project as shown below:

<dependency>

   <groupId>org.apache.spark</groupId>

   <artifactId>spark-core_2.11</artifactId>

   <version>2.1.0</version>

   <scope>provided</scope>

</dependency>

<dependency>

   <groupId>org.apache.spark</groupId>

   <artifactId>spark-sql_2.11</artifactId>

   <version>2.1.0</version>

   <scope>provided</scope>

</dependency>

<dependency>

   <groupId>org.springframework.boot</groupId>

   <artifactId>spring-boot-starter</artifactId>

</dependency>

You will see following error:

java.lang.IllegalStateException: Detected both log4j-over-slf4j.jar AND bound slf4j-log4j12.jar on the class path, preempting StackOverflowError

Root Cause

This happens because both Spark and Spring Boot package logging libraries which causes this conflict.

Solution

You need to remove the logging library from either of them. In my case, since I need to use Spark binaries present on the cluster, I had to remove logging from Spring Boot. Here is my modified Spring Boot dependency:

<dependency>

   <groupId>org.springframework.boot</groupId>

   <artifactId>spring-boot-starter</artifactId>

   <exclusions>

      <exclusion>

         <groupId>org.springframework.boot</groupId>

         <artifactId>spring-boot-starter-logging</artifactId>

      </exclusion>

   </exclusions>

</dependency>

Problem #2

Now, if you run your application, chances are you won’t see any error but still your application won’t be initialized. For example, here is my sample code:

@SpringBootApplication

public class SpringSampleApplication implements CommandLineRunner {

   public static void main(String[] args) {

      SpringApplication.run(SpringSampleApplication.class, args);

   }

   @Override

   public void run(String... args) throws Exception {

      SparkSession sparkSession = SparkSession

      .builder()

      .appName("SparkWithSpring")

      .master("local")

      .getOrCreate();

      System.out.println("Spark Version: " + sparkSession.version());

   }

}

When I run this, It does not print Spark version.

Root cause:

Since, we have removed logging from Spring Boot, we are now relying on Spark logging. Though, there is one more problem but we are not seeing Spring Boot errors on the console.

Solution:

Add log4j.properties to src/main/resources, as shown below:

log4j.rootLogger=INFO, console

log4j.appender.console=org.apache.log4j.ConsoleAppender

logrj.appender.console.Target=System.out

log4j.appender.console.layout=org.apache.log4j.PatternLayout

log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n

Problem #3

Now, you should see below error while running your application:

***************************

APPLICATION FAILED TO START

***************************

Description:

The Bean Validation API is on the classpath but no implementation could be found

Action:

Add an implementation, such as Hibernate Validator, to the classpath

Root cause:

Spark packages bean validation jar which Spring Boot is trying to auto configure.

Solution:

Add bean validation dependency as shown below:

<dependency>

   <groupId>org.springframework.boot</groupId>

   <artifactId>spring-boot-starter-validation</artifactId>

</dependency>

Now, when you run your application, it should be able to initialize Spring Boot and Spark Session together. In my case, it prints the Spark Version as expected along with other bootstrap messages from Spark:

.   ____          _            __ _ _
 /\\ / ___'_ __ _ _(_)_ __  __ _ \ \ \ \
( ( )\___ | '_ | '_| | '_ \/ _` | \ \ \ \
 \\/  ___)| |_)| | | | | || (_| |  ) ) ) )
  '  |____| .__|_| |_|_| |_\__, | / / / /
 =========|_|==============|___/=/_/_/_/
 :: Spring Boot ::       (v0.0.1-SNAPSHOT)

?Spark ?Version: 2.1.0

Though, typically programmers use Spark with Scala, however if you end up using Java and need to leverage Spring Boot, this article should get you going.

Thanks for reading!!!!

Source: https://relishcode.com

getnet A.

Software Engineering and Artificial Intelligence Professional

2 年

This good Idea!

回复

Hello Neeraj, I am working integration of Springboot Scala application. When working locally with all the setup, sqlContext.sql works perfectly, but when i try to communicate spark sql hosted on remote server, sqlContext.sql keeps on waiting and doesn't return anything. Can you help me sort out this issue?

回复
sujeet Kumar

Senior Consultant

3 年

Hii Neeraj Malhotra i have integrated spark dependency in my pom.xml of springboot , Application is working fine but the issye is i am not able to see springboot logs , in console spark logs are taking control and spring logs does not come in console. need your help

回复
Tasneem Bodabhaiwala

Software Developer at Accenture

3 年

Hi Neeraj, I am new to spark with spring-boot. I have created a spring-boot application and trying to establish connection to hive using spark-session. i am running my spring-boot application using 'mvn spring-boot:run'. But i am not able to connect to hive. When i run my java class, only having code related to hive connection using sparksession. I am able to establish connection using spark-submit. How do i run my spring-boot application so that i can establish connection to hive using sparksession

回复

要查看或添加评论,请登录

Neeraj Malhotra的更多文章

  • Unit testing Spark Spring Boot Applications

    Unit testing Spark Spring Boot Applications

    Unit testing is a critical part of any production-grade application and Big Data applications are no exception…

    1 条评论
  • Rising Cost of Technical Debt

    Rising Cost of Technical Debt

    Technical debt is the fallout of preferring an inferior short term solution instead of a superior long term solution…

  • Introduction to Threat Modeling

    Introduction to Threat Modeling

    Threat modeling is one of the key activity in the design phase of Security Development Lifecycle (SDL) that is promoted…

  • XSS Prevention using Input Validation

    XSS Prevention using Input Validation

    This is my last blog in the XSS Prevention series that I started after my deep dive into this area few months back…

  • XSS Prevention using HTML Sanitization

    XSS Prevention using HTML Sanitization

    In the last blog, we discussed about preventing XSS attacks by encoding the user supplied input before displaying it…

    2 条评论
  • XSS Prevention using Output Encoding

    XSS Prevention using Output Encoding

    As mentioned in the earlier blog, output encoding is the best defense against XSS. Output encoding depends on the…

  • XSS Prevention

    XSS Prevention

    Recently, I got an excellent opportunity to take deep dive into XSS (Cross Site Scripting) vulnerability. As part of my…

  • Spring REST for HTTP Conditional Updates

    Spring REST for HTTP Conditional Updates

    Since many of us use Spring Framework and have code written with Spring Web MVC, it is interesting to see how it can…

  • HTTP Conditional Updates with JAX-RS

    HTTP Conditional Updates with JAX-RS

    Finally I got the time to write this blog and conclude my previous blog on Concurrency using HTTP Conditional Updates…

  • Concurrency using HTTP Conditional Updates

    Concurrency using HTTP Conditional Updates

    Let us assume we have multiple clients that are concurrently getting and updating shared resource like Customer…

社区洞察

其他会员也浏览了