登录查看更多内容

Building Your First Spark : Logistic Regression Model

Shailendra Singh Kathait

Co-Founder & Chief Data Scientist @ Valiance | Envisioning a Future Transformed by AI | Harnessing AI Responsibly | Prioritizing Global Impact |

发布日期: 2015年11月17日

Spark has recently been gaining traction. So I thought of providing starting point to play with Spark. Have written a simple code for Logistic regression to help in transition.

I hope you guys find it interesting, and building block for learning spark. Spark is really interesting.

What Is Apache Spark?

Apache Spark is an open source processing engine built around speed, ease of use, & Analytics. Apache Spark is a cluster computing platform designed to be fast and general-purpose. Spark is alternative for large amounts of data that requires low latency processing that a typical Map Reduce program cannot provide. Spark performs at speeds up to 100 times faster than Map Reduce for iterative algorithms or interactive data mining. Spark provides in-memory cluster computing for lightning fast speed and supports Java, Scala, and Python

Spark combines SQL, streaming and complex Analytics together seamlessly in the same application to handle a wide range of data processing scenarios. Spark runs on top of Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources such as HDFS, Cassandra, HBase, or S3.

At its core, Spark is a “computational engine” that is responsible for scheduling, distributing, and monitoring applications consisting of many computational tasks across many worker machines, or a computing cluster. Because the core engine of Spark is both fast and general-purpose, it powers multiple higher-level components specialized for various workloads, such as SQL or machine learning. These components are designed to inter-operate closely, letting you combine them like libraries in a software project.

Components of Spark

Spark Core Concepts

At a high level, a Spark application consists of a driver program that launches various parallel operations on a cluster. The driver program contains the main function of your application which will be then distributed to the clusters members for execution. The SparkContext object is used by the driver program to access the computing cluster. For the shell applications the SparkContext is by default available through the sc variable.

A very important concept in Spark is RDD – resilient distributed data-set. This is an immutable collection of objects. Each RDD is split into multiple partitions, which may be computed on different nodes of the cluster. RDD can contain any type of object from Java, Scala, Python or R including user-defined classes. The RDDs can be created in two ways: by loading an external data-set or by distributing a collection of objects like list or sets.

After creation we can have two types of operation on the RDDS:

Transformations – construct a new RDD from an existing one
Actions – compute a result based on an RDD

RDDs are computed in a lazy way – that is when they are used in an action. Once Spark sees a chain of transformations, it can compute just the data needed for its result. Each time an action is run on an RDD, it is recomputed. If you need the RDD for multiple actions, you can ask Spark to persist it using RDD.persist().

You can use Spark from a shell session or as a standalone program. Either way you will have the following workflow:

create input RDDs
transform them using transformations
ask Spark to persist them if needed for reuse
launch actions to start parallel computation, which is then optimized and executed by Spark

Brief intro on Logistic Regression

Logistic Regression is a classification algorithm. Classification involves looking at data and assigning a class (or a label) to it. Usually there are more than one classes, but in our example, we’ll be tackling Binary Classification, in which there at two classes: 0 or 1.

PySpark

Most importantly for us, Spark supports a Python API to write Python Spark jobs or interact with data on cluster through a shell

Steps to launch spark and python shell.

1. First step is to go to folder where spark is build writing

cd /usr/local/spark

Launch python shell:

./bin/pyspark

To Launch Ipython notebook:

IPYTHON_OPTS=”notebook --pylab inline” ./bin/pyspark

After performing IPython Command, IPython(Now its called jupyter) notebook will appear on screen. Lets get started...

you can preview the data by using

#Preview data

raw_data.take(1)

Labeled Point

A labeled point is a local vector associated with a label/response. In MLlib, labeled points are used in supervised learning algorithms and they are stored as doubles. For binary classification, a label should be either 0 (negative) or 1 (positive).

Remember...
Spark provides specific functions to deal with RDDs which elements are key/value pairs. They are used to perform aggregations and other processings by key.

We have used the map function to create key-value pair. you can also notice Labeled Point output and that's why Labeled Point library was imported.

Mohammed Fazeel

Serving NP | Hybris Consultant

6 年

don't we need test data to check accuracy?

Trivendra Verma

Analytics Leader at Jubilant FoodWorks Ltd.

8 年

Great Work !!

Ashwani Kumar

Data Strategy | Business Intelligence | Alteryx | Tableau | Power BI

9 年

Straight and Simple ...Thanks for Sharing.

Balaji Krishnasamy

Senior Architect - DPP (Onetrust / Integris) | Big Data| Data Security | GDPR | GCP Data Engineer

9 年

Muthu.. informative... thanks..

Shalli Goel

Big Data and Cloud Trainer at AGILITICS PTE. LTD.

9 年

really interested in learning and exploring Apache Spark, specially with Scala programming language. How great are both of these : Spark with Scala. Since the time I started exploring Spark, I cannot resist myself exploring more and more. How easy and comfortable is life of software developers with these two. Now I am working on a project using Spark with Scala using all libraries.

查看更多评论

要查看或添加评论，请登录

查看全部

Building Your First Spark : Logistic Regression Model

Shailendra Singh Kathait

Co-Founder & Chief Data Scientist @ Valiance | Envisioning a Future Transformed by AI | Harnessing AI Responsibly | Prioritizing Global Impact |

Brief intro on Logistic Regression

PySpark

Most importantly for us, Spark supports a Python API to write Python Spark jobs or interact with data on cluster through a shell

Labeled Point

更多精彩文章

社区洞察

其他会员也浏览了

Getting started with PySpark on Google Colab

PySpark Why and When to Use

Understanding How Apache MLlib Empowers Scalable Machine Learning with Apache Spark

BigData Analytics with PySpark

Best Ways to Use Pandas with PySpark

An In-depth Exploration of PySpark: A Powerful Framework for Big Data Processing

PySpark

Dask vs Spark

Real-Time Sentiment Analysis with TCP Socket, Apache Spark, OpenAI, Kafka and Elasticsearch | Data Enginering pipeline project

How to Use External Python Packages in a PySpark Job on EMR Serverless: A Beginner’s Guide

Brief intro on Logistic Regression

PySpark

Most importantly for us, Spark supports a Python API to write Python Spark jobs or interact with data on cluster through a shell

Labeled Point

How did we reduce our data, computing, and storage requirements by roughly 96.6 % from the peak of 3 TB per day? Without dip in performance matrix

2023年5月27日

Identifying New Mineral Occurrence using Remote Sensing Images

2023年3月5日

Credit Risk Scorecard Monitoring

2017年5月17日

Introduction to Reinforcement Learning

2017年4月28日

Moving towards world powered by Artificial Intelligence & Deep Learning.

2016年10月4日

Collecting Twitter Stream : Using Python & MongoDB

2015年12月10日

What is Topic Modeling ?

2015年11月3日

Introduction to Support Vector Machines (SVM)

2015年10月21日

Machine Learning helps in building High Performing Agent Sales Force

2015年9月9日

Machine Learning: Identifying Serviceable Tweets

2015年7月29日

社区洞察

其他会员也浏览了

Getting started with PySpark on Google Colab

PySpark Why and When to Use

Understanding How Apache MLlib Empowers Scalable Machine Learning with Apache Spark

BigData Analytics with PySpark

Best Ways to Use Pandas with PySpark

An In-depth Exploration of PySpark: A Powerful Framework for Big Data Processing

PySpark

Dask vs Spark

Real-Time Sentiment Analysis with TCP Socket, Apache Spark, OpenAI, Kafka and Elasticsearch | Data Enginering pipeline project

How to Use External Python Packages in a PySpark Job on EMR Serverless: A Beginner’s Guide