Don't Stop at Pandas and Sklearn! Get Started with Spark DataFrames and Big Data ML using PySpark

Don't Stop at Pandas and Sklearn! Get Started with Spark DataFrames and Big Data ML using PySpark

Traditional data processing tools often fall short in big data projects – one in which the volume of data can be in the range of 100s of GBs.

This is because typical computing power is mostly limited to just a few gigabytes of RAM. As a result, loading a dataset at this massive scale for processing is impossible.

The above roadblock for analyzing large datasets is not just due to the unavailability of desired computing power but also due to the limitations of traditional data analysis tooling/libraries like NumPy, Pandas, Sklearn, etc.

This is because these tools are designed primarily for single-node processing. As a result, they struggle to seamlessly scale across multiple machines.

Spark Basics

Before getting into any technical hands-on details about Spark, let’s understand what some details around:

  • What is big data?
  • What are distributed systems?
  • What is Hadoop?

What is big data?

Imagine we have a dataset that comfortably resides on a local computer, falling within the range of 0 to 32 gigabytes, or perhaps extending to 40-50 gigabytes, depending on the available RAM.

Even if the dataset expands to, say, 30 to 60 gigabytes, it remains more feasible to acquire enough RAM to accommodate the entire dataset in memory.

This becomes imperative when the dataset becomes too voluminous for a single machine or when the efficiency of centralized storage starts diminishing.

And here's where Spark becomes useful.

When we find ourselves considering Spark, we’ve reached a point where fitting all your data into RAM or a single machine is no longer practical.

Spark, with its distributed computing capabilities, lets us address the challenges posed by increasingly substantial datasets by offering a scalable and efficient solution for big data processing.

We shall come back to Spark shortly. Before that, let’s spend some time understanding the differences between local and distributed computing.

Local vs. distributed computing

While the names of these methodologies are pretty much sufficient to understand what they do, let’s understand what they are in case you are new to them.

In a local system, which you are likely to be familiar with and use regularly, operations are confined to a single machine — a solitary computer with a unified RAM and hard drive.


In a distributed system, there exists a primary computer, often referred to as the master node, alongside the distribution of both data and computational tasks to additional computers in the network.

The critical contrast lies in the capabilities of these systems, which we shall discuss next.

Local System Limitations and Distributed System Advantages

In local systems, the processing power and storage are confined to the resources available on the local machine, which is dictated by the number of cores and the capacity of the machine.

Distributed systems, however, leverage the combined power of multiple machines, allowing for a substantial increase in processing cores and capabilities compared to a robust local machine.

For instance, as illustrated in the diagram, a local machine might have, for instance, five cores.

However, in a distributed system, the strategy involves aggregating less powerful machines, i.e., servers with lower specifications, and distributing both data and computations across this network.

This approach harnesses the inherent strength of distributed systems.


I often like to relate this idea to ensemble modeling in machine learning, wherein, weak models come together to produce a powerful model.

Scaling becomes a pivotal advantage of distributed systems over local systems.

More specifically, scaling a distributed system just means adding more machines to the network. In other words, one can significantly enhance processing power by simply adding more machines.


What is Hadoop?

At its core, Hadoop is a distributed storage and processing framework designed to handle vast amounts of data across clusters (groups) of traditional hardware.

要查看或添加评论,请登录

Adnan Faisal的更多文章

  • Beware of Evilginx: A New Threat to Multi-Factor Authentication

    Beware of Evilginx: A New Threat to Multi-Factor Authentication

    A new tool called Evilginx is shaking things up in the cybersecurity world. This open-source software can bypass…

    1 条评论
  • Cybersecurity Trends

    Cybersecurity Trends

    Here are some key trends in cybersecurity that have gained significant traction: 1. Zero Trust Architecture (ZTA)…

  • How to get money from tiktok ?

    How to get money from tiktok ?

    Are you wondering how you can make money from TikTok? You're not alone. With over 500 million active users, TikTok has…

  • How do I generate my passive income?

    How do I generate my passive income?

    In today's fast-paced world, many individuals strive to generate passive income to secure financial stability and…

  • How to Get More Views on Instagram: Easy Ways

    How to Get More Views on Instagram: Easy Ways

    In today's social media-driven world, Instagram has become a powerful platform for individuals and businesses to…

  • Freelancing: what are its aspects?

    Freelancing: what are its aspects?

    Remote Work: Remote work has become increasingly popular in recent years, but the COVID-19 pandemic has accelerated its…

  • Investing in clever way

    Investing in clever way

    Investing is the process of putting money into different types of assets with the expectation of earning a profit in…

  • How to make money on Fiverr?

    How to make money on Fiverr?

    An online marketplace called Fiverr enables independent contractors to market their services to clients all over the…

  • How to make $1,000 a day guaranteed with Proven record?

    How to make $1,000 a day guaranteed with Proven record?

    Making a guaranteed $1,000 a day may sound too good to be true, but it is nearly possible, if faithfully targeted to…

  • How to get motivated when depressed

    How to get motivated when depressed

    Although depression can make it difficult to feel motivated and complete everyday responsibilities, there are methods…

社区洞察

其他会员也浏览了