An open source cluster computing framework originally developed in the AMP Lab at University of California, Berkley .Later donated to Apache Software Foundation .Spark is not modified version of Hadoop, and it is not really dependent on Hadoop because it has its own cluster management. Spark uses Hadoop for only storage purpose.
- Spark Core: The foundational component that provides distributed task dispatching, scheduling, and basic I/O functionalities.
- Spark SQL: Allows SQL queries and integrates relational processing with Spark's functional programming API.
- Spark Streaming: Enables real-time processing of streaming data, handling data in mini-batches.
- MLlib: A scalable machine learning library offering various algorithms for classification, regression, clustering, and more.
- GraphX: Used for graph processing and analysis, suitable for social network analysis, fraud detection, and more.