Understanding about Why you need python and Spark SQL when working with PySpark.
Ganesha Swaroop B
|17+ yrs exp Software Testing|Author| Mentor|Staff SDET|Technical Writer|Technology Reasearcher|Java|Pytest|Python|Allure|ExtentReports|BDD|Jenkins|SME|Self Taught Data Science and ML Engineer
Hi Everyone,
As i am exploring more about what is data and how data is ingested under Big Data i am able to understand more about how data impacts business offerings and services offered by software products across platforms and technologies and how it plays important role in data driven decision making.
First of all I was trying to understand what formats of data are processed and how when data is on a cloud platform. Here are some information about it in depth.
Let's go..
First most of them know that Hadoop was the old Big data platform and Hive was used as the data warehouses in old times that is just 10 to 12 years back. One of the major shifts that has happened in the recent times is that Apache Spark a data processing tool has come up as one of the leading solutions that is being integrated with existing Big data solutions across many companies. The reason for this is Apache Spark which comes with Pyspark API that allows working with libraries of Spark has Distributed HDFS as opposed to HDFS earlier in earlier tools. Of course old tools have become obsolete now and has been replaced by Spark, Pyspark and related tools.
When data is considered for processing it is also important to note that there is no breakage in monitoring the live online data that is coming from the businesses like e commerce, internet banking systems and such other businesses. Now for that purpose most of the cloud based solutions use servers like Kafka or Flumes to continuously monitor data streaming from the business products offering various online services across various business domains.
Challenges faced in Big Data Processing:
One of the most common challenges faced in Big Data processing is that you cannot change the servers overnight and it takes certain amount of time to study and take a stage by stage process in replacing Big data servers that carry huge volumes of business data on a day to day basis. Even changes in versions of such tools makes a lot of difference in supporting day to day business of such enterprises.
For example: if you are using Snowflake server 2.5 let's assume as a data warehouse then even changing or upgrading to Snowflake server 2.6 or 3.0 version takes a lot of planning and business analysis before making the infrastructure changes.
What is Apache Spark?
Apache Spark is basically a tool that allows users to be able to process large volumes of unstructured business data with a lot of possibilities and provide a structured data which can then be used to perform ETL operations (Extract, Transform, Load). By doing this ML Engineers and Data Scientists can run the structured data set onto existing large learning models which can predict a business possibility based on the data patterns that can help data driven decision making in the design of the product for the higher management.
There are few things that you need to understand when it comes to Apache Spark in terms of its working.
How does Data engineer role vary from that of a Data Analyst and Data Scientist ? Why does all these roles seem similar?
A data engineer builds the data pipelines using python programming in most of the situations even if the company uses both open source and paid tools in its Big data infrastructure. In certain cases data engineer also works on ETL to convert unstructured data into structured data. But most of the time this part is taken by the data analysts hence Data Analysts and Data Scientists need to know a little bit of coding in python or SQL so as to survive in the industry.
Data Analysts are involved in taking the unstructured data read it using processing tools like Spark and Pyspark sessions and convert them into structured data after performing ETL. Further taking this structured data Data Analysts can write some SQL queries to further understand the patterns and then build a dashboard using PowerBI and other Visualization tools in the form of charts and graphs.
领英推荐
Data Analysts can perform ETL again on the structured data they receive from Data Warehouses. This is the reason that most of the times we see that roles of Data Analysts and Data Engineer and also Data Scientists seem to be similar but not same. Data Engineers might do ETL for converting unstructured data into structured data while Data analysts may perform ETL for bringing out certain pattern in the data based on the business requirement given to them by Product Managers while Data scientists may perform ETL on the structured data to get certain other predictions based on the probabilistic models on which the data is passed.
This is one of the most common reasons why people feel that there is a overlapping of roles and responsibilities of Data Engineer, Data Analysts and Data Scientists.
Freelancing Data Scientists and Data Analysts kind of do the job of converting unstructured data into structured data all by themselves as there is no availability of data engineers or data analysts.
At the moment the structure of this field is not yet clearly defined the borders between these above roles is very thin. In a longer run these things will be clearly defined and refined.
So ever wondered how the Data Science Team would be in terms of Process?
What is the difference between Business Analyst and Data Analyst?
A business analyst is a person who is a expert in a certain domain such as banking/manufacturing/Insurance/Healthcare etc.. and his interaction is more with the Development Team, Testing Team and clients.
A data analyst on the other hand is a person who can analyze data in huge volumes and come out with meaningful comparisons which allow the product managers to study and understand the products behavior in real world. A data analyst mostly interacts with the higher management most of the time.
Hope this information helps. More later still digging in!!!!!
Thanks,
Swaroop