Understanding about Why you need python and Spark SQL when working with PySpark.

Understanding about Why you need python and Spark SQL when working with PySpark.

Hi Everyone,

As i am exploring more about what is data and how data is ingested under Big Data i am able to understand more about how data impacts business offerings and services offered by software products across platforms and technologies and how it plays important role in data driven decision making.

First of all I was trying to understand what formats of data are processed and how when data is on a cloud platform. Here are some information about it in depth.

Let's go..

First most of them know that Hadoop was the old Big data platform and Hive was used as the data warehouses in old times that is just 10 to 12 years back. One of the major shifts that has happened in the recent times is that Apache Spark a data processing tool has come up as one of the leading solutions that is being integrated with existing Big data solutions across many companies. The reason for this is Apache Spark which comes with Pyspark API that allows working with libraries of Spark has Distributed HDFS as opposed to HDFS earlier in earlier tools. Of course old tools have become obsolete now and has been replaced by Spark, Pyspark and related tools.

When data is considered for processing it is also important to note that there is no breakage in monitoring the live online data that is coming from the businesses like e commerce, internet banking systems and such other businesses. Now for that purpose most of the cloud based solutions use servers like Kafka or Flumes to continuously monitor data streaming from the business products offering various online services across various business domains.

Challenges faced in Big Data Processing:

One of the most common challenges faced in Big Data processing is that you cannot change the servers overnight and it takes certain amount of time to study and take a stage by stage process in replacing Big data servers that carry huge volumes of business data on a day to day basis. Even changes in versions of such tools makes a lot of difference in supporting day to day business of such enterprises.

For example: if you are using Snowflake server 2.5 let's assume as a data warehouse then even changing or upgrading to Snowflake server 2.6 or 3.0 version takes a lot of planning and business analysis before making the infrastructure changes.

What is Apache Spark?

Apache Spark is basically a tool that allows users to be able to process large volumes of unstructured business data with a lot of possibilities and provide a structured data which can then be used to perform ETL operations (Extract, Transform, Load). By doing this ML Engineers and Data Scientists can run the structured data set onto existing large learning models which can predict a business possibility based on the data patterns that can help data driven decision making in the design of the product for the higher management.

There are few things that you need to understand when it comes to Apache Spark in terms of its working.

  1. What is Spark core: Spark core is basically the data processing tool that can read data from files like .txt, .csv, .xml, .json, .dat and helps us create a RDD. This RDD is not mutable after its creation. After creation of the RDD we can use certain transformation functions and actions which gives us some structured data. This can be further utilized for pattern analysis and other types of business analysis by Data Analysts and Data scientists to offer valid and data driven insights into the business activities.
  2. PySpark with Spark: Here Spark library was made accessible with a "Spark Context Object" which enhanced the processing capabilities of Spark. PySpark API opens a Spark session which in turn allows users to work with the spark library and use the library methods to perform various ETL operations on the unstructured data that is obtained from the online monitoring servers like Tensor Flow or Kafka or Flumes. However Spark actually converts unstructured data into "Data Frames" which can read data from other data bases like Green plum, Snowflake, MySQL, MariaDB, Terraform, Aurora, RDS, Redshift, Athena, Cassandra and file formats like .txt, .csv,.json, Avro. Apache spark also provides cluster manager where incoming data can be ingested on parallel machines which uses Executors and actions to obtain structured data.
  3. Data Frames in Pyspark: Data Frames is basically a format in which unstructured data is stored but not registered as a Database Table. It only shows spark as to how the data is not as stored in tables. Hence it is necessary to convert this Data Frame into Table so that end user can then take the table and use Spark SQL to write SQL based queries to perform ETL operations on the data table and then provide it to "Data Analysts and Data Scientists".
  4. Where is Python in all of this? : Once the data frames are created using pyspark session a person who is strong in python coding can run a python program on the data frames to convert it into data table and then use Spark SQL queries to perform ETL on the data gathered.
  5. What is Catalyst and CBO in Spark?: Catalyst and CBO are the optimizer techniques used within Apache Spark to internally convert the Data Frames into RDD and then the Data Table. On the other hand CBO (Cost Based Optimizer) has the ability to generate 2 to 3 plans of how data frames can be converted into RDD internally and then registered into data table and takes the best optimized approach to perform the task.
  6. Why use Tensor Flow? What are the alternatives to Tensor Flow? : Tensor Flow is one of the monitoring servers that collects/streams online data and stores it on its server. Data Analysts can rely on this machine to make sure that there is always a server containing all the data that is streamed online at all times and there is no loss of business data. Tensor Flow works well with open source Big Data Infrastructure and hence you find Tensor Flow with Python based Big data solutions. Other open source Big Data BI tools include Seaborn, MatPlotLib, Plotly, ML libraries include SciPy, Num Py, Scikit and many others.

How does Data engineer role vary from that of a Data Analyst and Data Scientist ? Why does all these roles seem similar?

A data engineer builds the data pipelines using python programming in most of the situations even if the company uses both open source and paid tools in its Big data infrastructure. In certain cases data engineer also works on ETL to convert unstructured data into structured data. But most of the time this part is taken by the data analysts hence Data Analysts and Data Scientists need to know a little bit of coding in python or SQL so as to survive in the industry.

Data Analysts are involved in taking the unstructured data read it using processing tools like Spark and Pyspark sessions and convert them into structured data after performing ETL. Further taking this structured data Data Analysts can write some SQL queries to further understand the patterns and then build a dashboard using PowerBI and other Visualization tools in the form of charts and graphs.

Data Analysts can perform ETL again on the structured data they receive from Data Warehouses. This is the reason that most of the times we see that roles of Data Analysts and Data Engineer and also Data Scientists seem to be similar but not same. Data Engineers might do ETL for converting unstructured data into structured data while Data analysts may perform ETL for bringing out certain pattern in the data based on the business requirement given to them by Product Managers while Data scientists may perform ETL on the structured data to get certain other predictions based on the probabilistic models on which the data is passed.

This is one of the most common reasons why people feel that there is a overlapping of roles and responsibilities of Data Engineer, Data Analysts and Data Scientists.

Freelancing Data Scientists and Data Analysts kind of do the job of converting unstructured data into structured data all by themselves as there is no availability of data engineers or data analysts.

At the moment the structure of this field is not yet clearly defined the borders between these above roles is very thin. In a longer run these things will be clearly defined and refined.

So ever wondered how the Data Science Team would be in terms of Process?

  1. Product Manager: Defines the product requirements based on the data and inputs given by the Data Analysts and also predicts what should be built into the product based on inputs from a Data Scientist. Creates business requirements for ML engineers to design AI capabilities and machine learning engineers develop programs to build the feature into the product to drive better services from the usage of product.
  2. Machine Learning Engineers: Take business requirements from the product manager and tries to develop programs using python to build the feature and intelligence into the product using the machine learning models and training models. This involves Feature Engineering, Testing and Training.
  3. Data Analyst: Sometimes takes unstructured data and converts into structured data and comes up with graphical representation of the product usage patterns for data driven decision making for product managers and higher management.
  4. Data Scientists: Sometimes takes unstructured data and converts into structured data and then runs the data through predictive probabilistic models to predict product need in future and give unique insights into the business the product supports.
  5. Data Engineers: These people are involved in developing programs that create CI/CD pipelines ensuring proper and continuous monitoring of existing data pipelines and create new data pipelines based on the business requirement.

What is the difference between Business Analyst and Data Analyst?

A business analyst is a person who is a expert in a certain domain such as banking/manufacturing/Insurance/Healthcare etc.. and his interaction is more with the Development Team, Testing Team and clients.

A data analyst on the other hand is a person who can analyze data in huge volumes and come out with meaningful comparisons which allow the product managers to study and understand the products behavior in real world. A data analyst mostly interacts with the higher management most of the time.

Hope this information helps. More later still digging in!!!!!

Thanks,

Swaroop




要查看或添加评论,请登录

社区洞察

其他会员也浏览了