Apache Spark
Kishor Kumar Krishna
Data Scientist |AWS|AI&ML| SQL | Power BI | Advanced Excel | Python | Pandas | NumPy | Seaborn | Matplotlib |Pursuing Post Graduate in Data Science & AI from IIIT Bangalore |
Apache Spark is an in-memory data processing framework designed for large-scale distributed data processing. Known for its speed, it significantly outperforms traditional Hadoop MapReduce, making it a popular choice for big data processing.
Key Libraries and Components
Description: MLlib is Spark's scalable machine learning library.
Key Features: It includes common algorithms and utilities like classification, regression, clustering, collaborative filtering, dimensionality reduction, and underlying optimization primitives.
Description: Spark SQL is a module for working with structured data.
Key Features: It allows querying data using SQL as well as the Apache Hive variant of SQL called HQL. It also supports various data sources like Parquet, JSON, and ORC.
Description: Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine.
Key Features: It allows you to process and analyze streaming data in real-time. It provides a high-level API for stream processing and integrates seamlessly with batch processing.
Description: GraphX is Spark's API for graphs and graph-parallel computation.
Key Features: It simplifies the process of graph processing and provides a set of graph algorithms (e.g., PageRank) and utilities to manipulate graphs.
Apache Spark Characteristics
Speed
Ease of Use
Modularity
Extensibility
Key Components of Apache Spark
Spark SQL and DataFrames/Datasets
领英推荐
Spark Streaming (Structured Streaming)
Machine Learning (MLlib)
Graph Processing (GraphX)
Supported Programming Languages
#Scala
Scala is Spark’s native language, offering concise syntax and functional programming capabilities. It allows developers to write efficient, expressive, and type-safe code. Spark APIs are first designed in Scala, making it the most feature-complete language for Spark development.
#SQL
Spark SQL allows users to query structured data using SQL syntax. It supports a subset of the ANSI SQL standard and integrates with Spark’s Catalyst optimizer, providing efficient query execution. Spark SQL can be used interactively in the Spark shell or through programmatic APIs.
Python
PySpark is the Python API for Spark, enabling developers to write Spark applications using Python. PySpark provides bindings for Spark’s core functionalities, including Spark SQL, DataFrames, Datasets, and MLlib. It’s popular among data scientists for its simplicity and integration with libraries like pandas and numpy.
Java
Spark provides Java APIs for its core functionalities, allowing developers to build Spark applications in Java. While the Java API is slightly more verbose than Scala or Python, it offers type safety and seamless integration with existing Java codebases.
R
SparkR is the R API for Spark, designed for data scientists who prefer using R for statistical analysis and data visualization. SparkR provides bindings for Spark’s DataFrame and MLlib functionalities, enabling scalable data processing and machine learning in R.
Apache Spark Installation on Windows
Setup Apache Spark (PySpark) on Windows PC where we have installed JDK, Python, Hadoop and Apache Spark. Please find the below installation links/steps:
PySpark installation steps on MAC: https://sparkbyexamples.com/pyspark/h...
1. Download JDK: https://www.oracle.com/in/java/techno...
2. Download Python: https://www.python.org/downloads/
3. Download Spark: https://spark.apache.org/downloads.html
Winutils repo link: https://github.com/steveloughran/winu...
Environment Variables: HADOOP_HOME- C:\hadoop JAVA_HOME- C:\java\jdk SPARK_HOME- C:\spark\spark-3.3.1-bin-hadoop2 PYTHONPATH- %SPARK_HOME%\python;%SPARK_HOME%\python\lib\py4j-0.10.9-src;%PYTHONPATH%
Required Paths: %SPARK_HOME%\bin %HADOOP_HOME%\bin %JAVA_HOME%\bin