登录查看更多内容

Don't Stop at Pandas and Sklearn! Get Started with Spark DataFrames and Big Data ML using PySpark

Adnan Faisal

Engineer

发布日期: 2024年9月13日

Traditional data processing tools often fall short in big data projects – one in which the volume of data can be in the range of 100s of GBs.

This is because typical computing power is mostly limited to just a few gigabytes of RAM. As a result, loading a dataset at this massive scale for processing is impossible.

The above roadblock for analyzing large datasets is not just due to the unavailability of desired computing power but also due to the limitations of traditional data analysis tooling/libraries like NumPy, Pandas, Sklearn, etc.

This is because these tools are designed primarily for single-node processing. As a result, they struggle to seamlessly scale across multiple machines.

Spark Basics

Before getting into any technical hands-on details about Spark, let’s understand what some details around:

What is big data?
What are distributed systems?
What is Hadoop?

What is big data?

Imagine we have a dataset that comfortably resides on a local computer, falling within the range of 0 to 32 gigabytes, or perhaps extending to 40-50 gigabytes, depending on the available RAM.

Even if the dataset expands to, say, 30 to 60 gigabytes, it remains more feasible to acquire enough RAM to accommodate the entire dataset in memory.

This becomes imperative when the dataset becomes too voluminous for a single machine or when the efficiency of centralized storage starts diminishing.

And here's where Spark becomes useful.

When we find ourselves considering Spark, we’ve reached a point where fitting all your data into RAM or a single machine is no longer practical.

Spark, with its distributed computing capabilities, lets us address the challenges posed by increasingly substantial datasets by offering a scalable and efficient solution for big data processing.

We shall come back to Spark shortly. Before that, let’s spend some time understanding the differences between local and distributed computing.

Local vs. distributed computing

While the names of these methodologies are pretty much sufficient to understand what they do, let’s understand what they are in case you are new to them.

In a local system, which you are likely to be familiar with and use regularly, operations are confined to a single machine — a solitary computer with a unified RAM and hard drive.

领英推荐

Data Science Myths Debunked: What Every Aspirant…

AAFT 2 个月前

Supercharging Big Data Analytics with Apache Spark and…

ITVersity, Inc. 1 个月前

Tools of Data Science: Empowering Insights and…

Sankhyana Consultancy Services Pvt. Ltd. 4 个月前

In a distributed system, there exists a primary computer, often referred to as the master node, alongside the distribution of both data and computational tasks to additional computers in the network.

The critical contrast lies in the capabilities of these systems, which we shall discuss next.

Local System Limitations and Distributed System Advantages

In local systems, the processing power and storage are confined to the resources available on the local machine, which is dictated by the number of cores and the capacity of the machine.

Distributed systems, however, leverage the combined power of multiple machines, allowing for a substantial increase in processing cores and capabilities compared to a robust local machine.

For instance, as illustrated in the diagram, a local machine might have, for instance, five cores.

However, in a distributed system, the strategy involves aggregating less powerful machines, i.e., servers with lower specifications, and distributing both data and computations across this network.

This approach harnesses the inherent strength of distributed systems.

I often like to relate this idea to ensemble modeling in machine learning, wherein, weak models come together to produce a powerful model.

Scaling becomes a pivotal advantage of distributed systems over local systems.

More specifically, scaling a distributed system just means adding more machines to the network. In other words, one can significantly enhance processing power by simply adding more machines.

What is Hadoop?

At its core, Hadoop is a distributed storage and processing framework designed to handle vast amounts of data across clusters (groups) of traditional hardware.

要查看或添加评论，请登录

Adnan Faisal的更多文章

Beware of Evilginx: A New Threat to Multi-Factor Authentication

2024年9月29日

Beware of Evilginx: A New Threat to Multi-Factor Authentication

A new tool called Evilginx is shaking things up in the cybersecurity world. This open-source software can bypass…

1 条评论
Cybersecurity Trends

2024年9月13日

Cybersecurity Trends

Here are some key trends in cybersecurity that have gained significant traction: 1. Zero Trust Architecture (ZTA)…
How to get money from tiktok ?

2023年5月18日

How to get money from tiktok ?

Are you wondering how you can make money from TikTok? You're not alone. With over 500 million active users, TikTok has…
How do I generate my passive income?

2023年5月16日

How do I generate my passive income?

In today's fast-paced world, many individuals strive to generate passive income to secure financial stability and…
How to Get More Views on Instagram: Easy Ways

2023年5月16日

How to Get More Views on Instagram: Easy Ways

In today's social media-driven world, Instagram has become a powerful platform for individuals and businesses to…
Freelancing: what are its aspects?

2023年5月7日

Freelancing: what are its aspects?

Remote Work: Remote work has become increasingly popular in recent years, but the COVID-19 pandemic has accelerated its…
Investing in clever way

2023年5月7日

Investing in clever way

Investing is the process of putting money into different types of assets with the expectation of earning a profit in…
How to make money on Fiverr?

2023年5月4日

How to make money on Fiverr?

An online marketplace called Fiverr enables independent contractors to market their services to clients all over the…
How to make $1,000 a day guaranteed with Proven record?

2023年4月28日

How to make $1,000 a day guaranteed with Proven record?

Making a guaranteed $1,000 a day may sound too good to be true, but it is nearly possible, if faithfully targeted to…
How to get motivated when depressed

2023年4月24日

How to get motivated when depressed

Although depression can make it difficult to feel motivated and complete everyday responsibilities, there are methods…

See all articles

Don't Stop at Pandas and Sklearn! Get Started with Spark DataFrames and Big Data ML using PySpark

Adnan Faisal

Engineer

Spark Basics

What is big data?

Local vs. distributed computing

领英推荐

Local System Limitations and Distributed System Advantages

What is Hadoop?

Adnan Faisal的更多文章

社区洞察

其他会员也浏览了

Fast Kullback-Leibler Divergence Using Spark

A Comprehensive Guide to CSV Files vs. Parquet Files in PySpark

From Analysts to Data Scientists

Database for recommendation systems, content generators, or any AI solution that relies on vector-based data

How to Transition into Data Science: A Three-Step Approach

?? DATA Pill #120 - Just use Postgres, How Pytorch Powers Training Inference

Demystifying Resilient Distributed Datasets (RDD) in Apache Spark

Big Data Processing with PySpark in Databricks

75 Big Data Terms To Make Your Father Proud

Best Practices and Spark optimisation Tips for Data engineers

Spark Basics

What is big data?

Local vs. distributed computing

领英推荐

Local System Limitations and Distributed System Advantages

What is Hadoop?

Adnan Faisal的更多文章

Beware of Evilginx: A New Threat to Multi-Factor Authentication

Cybersecurity Trends

How to get money from tiktok ?

How do I generate my passive income?

How to Get More Views on Instagram: Easy Ways

Freelancing: what are its aspects?

Investing in clever way

How to make money on Fiverr?

How to make $1,000 a day guaranteed with Proven record?

How to get motivated when depressed

社区洞察

其他会员也浏览了

Fast Kullback-Leibler Divergence Using Spark

A Comprehensive Guide to CSV Files vs. Parquet Files in PySpark

From Analysts to Data Scientists

Database for recommendation systems, content generators, or any AI solution that relies on vector-based data

How to Transition into Data Science: A Three-Step Approach

?? DATA Pill #120 - Just use Postgres, How Pytorch Powers Training Inference

Demystifying Resilient Distributed Datasets (RDD) in Apache Spark

Big Data Processing with PySpark in Databricks

75 Big Data Terms To Make Your Father Proud

Best Practices and Spark optimisation Tips for Data engineers