Don't Stop at Pandas and Sklearn! Get Started with Spark DataFrames and Big Data ML using PySpark
Traditional data processing tools often fall short in big data projects – one in which the volume of data can be in the range of 100s of GBs.
This is because typical computing power is mostly limited to just a few gigabytes of RAM. As a result, loading a dataset at this massive scale for processing is impossible.
The above roadblock for analyzing large datasets is not just due to the unavailability of desired computing power but also due to the limitations of traditional data analysis tooling/libraries like NumPy, Pandas, Sklearn, etc.
This is because these tools are designed primarily for single-node processing. As a result, they struggle to seamlessly scale across multiple machines.
Spark Basics
Before getting into any technical hands-on details about Spark, let’s understand what some details around:
What is big data?
Imagine we have a dataset that comfortably resides on a local computer, falling within the range of 0 to 32 gigabytes, or perhaps extending to 40-50 gigabytes, depending on the available RAM.
Even if the dataset expands to, say, 30 to 60 gigabytes, it remains more feasible to acquire enough RAM to accommodate the entire dataset in memory.
This becomes imperative when the dataset becomes too voluminous for a single machine or when the efficiency of centralized storage starts diminishing.
And here's where Spark becomes useful.
When we find ourselves considering Spark, we’ve reached a point where fitting all your data into RAM or a single machine is no longer practical.
Spark, with its distributed computing capabilities, lets us address the challenges posed by increasingly substantial datasets by offering a scalable and efficient solution for big data processing.
We shall come back to Spark shortly. Before that, let’s spend some time understanding the differences between local and distributed computing.
Local vs. distributed computing
While the names of these methodologies are pretty much sufficient to understand what they do, let’s understand what they are in case you are new to them.
In a local system, which you are likely to be familiar with and use regularly, operations are confined to a single machine — a solitary computer with a unified RAM and hard drive.
领英推荐
In a distributed system, there exists a primary computer, often referred to as the master node, alongside the distribution of both data and computational tasks to additional computers in the network.
The critical contrast lies in the capabilities of these systems, which we shall discuss next.
Local System Limitations and Distributed System Advantages
In local systems, the processing power and storage are confined to the resources available on the local machine, which is dictated by the number of cores and the capacity of the machine.
Distributed systems, however, leverage the combined power of multiple machines, allowing for a substantial increase in processing cores and capabilities compared to a robust local machine.
For instance, as illustrated in the diagram, a local machine might have, for instance, five cores.
However, in a distributed system, the strategy involves aggregating less powerful machines, i.e., servers with lower specifications, and distributing both data and computations across this network.
This approach harnesses the inherent strength of distributed systems.
I often like to relate this idea to ensemble modeling in machine learning, wherein, weak models come together to produce a powerful model.
Scaling becomes a pivotal advantage of distributed systems over local systems.
More specifically, scaling a distributed system just means adding more machines to the network. In other words, one can significantly enhance processing power by simply adding more machines.
What is Hadoop?
At its core, Hadoop is a distributed storage and processing framework designed to handle vast amounts of data across clusters (groups) of traditional hardware.