登录查看更多内容

What is the use of SQL, Power BI, Snowflake, PySpark, Apache Spark, JupyterLab, Spark SQL, Python in Data Science?

Ganesha Swaroop B

|17+ yrs exp Software Testing|Author| Mentor|Staff SDET|Technical Writer|Technology Reasearcher|Java|Pytest|Python|Allure|ExtentReports|BDD|Jenkins|SME|Self Taught Data Science and ML Engineer

发布日期: 2024年7月16日

Hi Everyone,

Data Science is a vast field that encompasses a lot of things. It would be difficult to put all things at once and also confusing a lot. So let's go step by step.

Data Science can be divided into 2 parts for comfort level.

Teams that interact with Data and
Teams that work on Machine Learning models and AI tools

Let's go!!

Teams interacting with Data include the Data Engineer who builds CI/CD pipelines so that Online data that is being collected from various sources are structured into a proper format so that it can be studied to check for repeating patterns and cleared of unwanted data using SQL queries. It also involves monitoring servers like tensorflow to check for all types of data that is being streamed into these servers due to the use of the application by the end users from the front end of the business. Extensive usage of SQL queries to flush out unwanted data and formating it will be part of his Job here.

Data Analyst is the next person who analyzes the formatted data given by the Data Engineers and tries to identify unique usage patterns and information and converts this formatted data into Visual format using the tools like PowerBI (Data Visulaization tool) are frequently used here. PowerBI supports a wide variety of file formats ranging from Excel to CSV and using this it is much safer to access an Online Transaction Processing System (OLTP) and interact with that database hosted on cloud platforms like AWS/GCP/Azure.

Snowflake is a Data Warehousing tool that can be hosted on cloud platform and you can store huge amounts of data in these warehouses and then you can pullup the data from Snowflake to PowerBI to generate Visual analysis of the formatted data given by the Data Engineers.

In small size companies a Data Scientist could work as all these people at various points if he/she can handle the shifting between roles.

Another Data Visualization tool available these days is the Tableau where you need R Language to interact with the database warehouse stored inside Tableau Servers. This is an alternative to using PowerBI and Snowflake.

领英推荐

SQL Insights: In Conversation With Ajinkya Sheth…

LearnSQL.com 1 年前

Hiding within those mounds of data is knowledge that…

Santosh Raman Mishra 4 年前

SQL: The Basics for Data Science Newbies | Learnbay

Shanti A 3 年前

Data Scientist is really a Phd person in general Statistics and Math who can work on understanding the visual data provided by the data analyst and with the help of a few predictive models can provide a probabilistic direction in which the product can move in order to achieve more business and make more profits and take data driven business decisions for future. These people are knowledgeable in the fields of Deep Learning involving understanding of Neural Networks. These people also understand the various ways of learning techniques used by humans and try to put this type of learning models through which when Data is processed the outcome is a predictive probability. Calculus and Probability forms a major expertise for these people.

Teams that work on ML and AI are basically in a different category where they do not interact much with data instead they work on finding ways to build AI tools like ChatGPT, Dalle, through programming and finetuning these programs. This requires a deep understanding of different machine learning models such as LLM's. One of the languages used to build such AI tools is Python and therefore we need to understand advanced programming in Python.

So technically you need the following to be able to enter into Data Science:

A possible Masters Degree in Data Science from a trusted university
Excellent in Statistics and Math
Good at handling large sets of data and Querying the databases (Like experience in SSIS, SSRS, Crystal Reports, SQL, Postgres, Oracle, Hadoop as the underlying platform to process huge sets of data)
Expertise in working on PowerBI and Snowflake, PowerBI helps in visualizing large amounts of formatted Data through an interface to login to Snowflake and SQL queries by Data Engineers.
Any one Cloud Certification with working experience in AWS/GCP/Azure
Programming Expertise in Python or R
Tools like Apache Spark Framework, PySparkSQL, Pyspark api, JupyterLab, findspark, Python 3.7 onwards. All these can be installed on a Mac or Windows machine using the pip command. After installation a Spark session is activated and then the Pyspark window opens allowing Data Engineers to work with Apache Spark Framework using the Pyspark API to integrate with Apache Spark Framework that processes huge volumes of data from various servers. On both Windows and Mac machines Pyspark window or session interacts with a Cluster manager that in turn manages various executions and Actions. As an alternative to Apache Spark you can also use Pandas which does the same job as that of Apache Spark (Processing huge volumes of data using RDD) but the working of Pandas is a bit different than that of Apache Spark. The advantage here is that Apache Spark works almost 100 times faster than Hadoop.
Experience in preparing CI/CD pipelines

I hope this information helps.

Since Python is being used as the underlying language tools that support the language are therefore preferred. Data Engineers further use python libraries such as Pytorch, NumPy, SciPy, Plotly, Matplotlib, PySpark, Apache Spark Framework .

Will talk about others like Pandas, Pytorch, NumPy, SciPy, MatPlotlib, Plotly in the next article as these are not mandatory with Apache Spark, Pyspark, Python, JupyterLab, PysparkSQL.

Thanks,

Swaroop

要查看或添加评论，请登录

Ganesha Swaroop B的更多文章

Few Machine Learning Topics

2024年8月31日

Few Machine Learning Topics

Hello Everyone, Its been a while i was going through understanding the need for ML engineers to create a ML pipeline…

3 条评论
How is the work of Data Science Team members impact the Sales of a Business Product?

2024年7月28日

How is the work of Data Science Team members impact the Sales of a Business Product?

Hi Everyone, While my exploration into understanding Data's impact on Business based software product sales continues i…
Understanding about Why you need python and Spark SQL when working with PySpark.

2024年7月25日

Understanding about Why you need python and Spark SQL when working with PySpark.

Hi Everyone, As i am exploring more about what is data and how data is ingested under Big Data i am able to understand…
What are the things to learn when AWS cloud services are used for a Data Engineering Project?

2024年7月19日

What are the things to learn when AWS cloud services are used for a Data Engineering Project?

Hi Everyone, As i was trying to figure out how does a data pipeline get created on Cloud environment I was able to get…

3 条评论
What and who uses the Python Libraries like Pandas, NumPy, SciPy, Jupyter Notebook in relation to Data Science and the roles of its members?

2024年7月16日

What and who uses the Python Libraries like Pandas, NumPy, SciPy, Jupyter Notebook in relation to Data Science and the roles of its members?

Hi Everyone, So there is a lot more to Data Science as a field than just knowing about python and a bit of SQL queries…
What is Data Scientist, Data Engineer and Data Analyst?

2024年7月15日

What is Data Scientist, Data Engineer and Data Analyst?

Hi Everyone, Today with the rainy whether i wanted to shed a little knowledge about Data Science and its other…

2 条评论
Diving into the uses of Python

2024年6月6日

Diving into the uses of Python

After a lot of thought i felt it could help a few people if i could give some information as to how and where Python…
A little more detail about using Pytest with Selenium:

2024年5月20日

A little more detail about using Pytest with Selenium:

Basically Pytest is a library that can be integrated with Selenium Library to automate regression test cases and just…
Understanding about Java script based Frameworks

2024年1月4日

Understanding about Java script based Frameworks

There are 2 types of Frameworks when it comes to Information Technology. This happens with every language and platform…
Why and What are the things to concentrate for the future of SDET/Software Testing?

2023年12月31日

Why and What are the things to concentrate for the future of SDET/Software Testing?

Hi Everyone, First let us understand "Why" you must concentrate on upgrading in certain areas for the future of SDET…

See all articles

What is the use of SQL, Power BI, Snowflake, PySpark, Apache Spark, JupyterLab, Spark SQL, Python in Data Science?

Ganesha Swaroop B

|17+ yrs exp Software Testing|Author| Mentor|Staff SDET|Technical Writer|Technology Reasearcher|Java|Pytest|Python|Allure|ExtentReports|BDD|Jenkins|SME|Self Taught Data Science and ML Engineer

领英推荐

Ganesha Swaroop B的更多文章

社区洞察

其他会员也浏览了

SQL: The Basics for Data Science Newbies | Learnbay

Mastering Spark SQL Functions: A Comprehensive Guide

Mastering the Technical Stacks: A Guide for Data & Analytics Professionals

Meet Ultipa Manager: Toolkits for Data Scientists

Why SQL is a must in data science?

Spark Tidbits - Lesson 6

Top Data Science Tools

SQL vs. NoSQL – Which Should You Learn First?

Uncovering Actionable Insights: Top Data Analysis Tools

领英推荐

Ganesha Swaroop B的更多文章

Few Machine Learning Topics

How is the work of Data Science Team members impact the Sales of a Business Product?

Understanding about Why you need python and Spark SQL when working with PySpark.

What are the things to learn when AWS cloud services are used for a Data Engineering Project?

What and who uses the Python Libraries like Pandas, NumPy, SciPy, Jupyter Notebook in relation to Data Science and the roles of its members?

What is Data Scientist, Data Engineer and Data Analyst?

Diving into the uses of Python

A little more detail about using Pytest with Selenium:

Understanding about Java script based Frameworks

Why and What are the things to concentrate for the future of SDET/Software Testing?

社区洞察

其他会员也浏览了

SQL: The Basics for Data Science Newbies | Learnbay

Mastering Spark SQL Functions: A Comprehensive Guide

Mastering the Technical Stacks: A Guide for Data & Analytics Professionals

Meet Ultipa Manager: Toolkits for Data Scientists

Why SQL is a must in data science?

Spark Tidbits - Lesson 6

Top Data Science Tools

SQL vs. NoSQL – Which Should You Learn First?

Uncovering Actionable Insights: Top Data Analysis Tools