What is the use of SQL, Power BI, Snowflake, PySpark, Apache Spark, JupyterLab, Spark SQL, Python in Data Science?
Ganesha Swaroop B
|17+ yrs exp Software Testing|Author| Mentor|Staff SDET|Technical Writer|Technology Reasearcher|Java|Pytest|Python|Allure|ExtentReports|BDD|Jenkins|SME|Self Taught Data Science and ML Engineer
Hi Everyone,
Data Science is a vast field that encompasses a lot of things. It would be difficult to put all things at once and also confusing a lot. So let's go step by step.
Data Science can be divided into 2 parts for comfort level.
Let's go!!
Teams interacting with Data include the Data Engineer who builds CI/CD pipelines so that Online data that is being collected from various sources are structured into a proper format so that it can be studied to check for repeating patterns and cleared of unwanted data using SQL queries. It also involves monitoring servers like tensorflow to check for all types of data that is being streamed into these servers due to the use of the application by the end users from the front end of the business. Extensive usage of SQL queries to flush out unwanted data and formating it will be part of his Job here.
Data Analyst is the next person who analyzes the formatted data given by the Data Engineers and tries to identify unique usage patterns and information and converts this formatted data into Visual format using the tools like PowerBI (Data Visulaization tool) are frequently used here. PowerBI supports a wide variety of file formats ranging from Excel to CSV and using this it is much safer to access an Online Transaction Processing System (OLTP) and interact with that database hosted on cloud platforms like AWS/GCP/Azure.
Snowflake is a Data Warehousing tool that can be hosted on cloud platform and you can store huge amounts of data in these warehouses and then you can pullup the data from Snowflake to PowerBI to generate Visual analysis of the formatted data given by the Data Engineers.
In small size companies a Data Scientist could work as all these people at various points if he/she can handle the shifting between roles.
Another Data Visualization tool available these days is the Tableau where you need R Language to interact with the database warehouse stored inside Tableau Servers. This is an alternative to using PowerBI and Snowflake.
领英推荐
Data Scientist is really a Phd person in general Statistics and Math who can work on understanding the visual data provided by the data analyst and with the help of a few predictive models can provide a probabilistic direction in which the product can move in order to achieve more business and make more profits and take data driven business decisions for future. These people are knowledgeable in the fields of Deep Learning involving understanding of Neural Networks. These people also understand the various ways of learning techniques used by humans and try to put this type of learning models through which when Data is processed the outcome is a predictive probability. Calculus and Probability forms a major expertise for these people.
Teams that work on ML and AI are basically in a different category where they do not interact much with data instead they work on finding ways to build AI tools like ChatGPT, Dalle, through programming and finetuning these programs. This requires a deep understanding of different machine learning models such as LLM's. One of the languages used to build such AI tools is Python and therefore we need to understand advanced programming in Python.
So technically you need the following to be able to enter into Data Science:
I hope this information helps.
Since Python is being used as the underlying language tools that support the language are therefore preferred. Data Engineers further use python libraries such as Pytorch, NumPy, SciPy, Plotly, Matplotlib, PySpark, Apache Spark Framework .
Will talk about others like Pandas, Pytorch, NumPy, SciPy, MatPlotlib, Plotly in the next article as these are not mandatory with Apache Spark, Pyspark, Python, JupyterLab, PysparkSQL.
Thanks,
Swaroop