登录查看更多内容

A non-technical guide to Data Science

Snigdha Kakkar

?? Accelerate your AI career with daily insights! | 6x LinkedIn Top Voice (Generative AI, Data Science, Machine Learning) | Innovating in Generative AI space | Join 21K+ followers

发布日期: 2020年6月15日

For beginners, the words data science and machine learning have always been complex. While most of the articles on data science and machine learning are oriented towards their applications in diverse industries and are primarily written for developers and pros, this article aims to explain the core concepts of data science and machine learning for a non-technical person who has interest in learning about data-driven insights in businesses.

Data Science is basically a set of fundamental principles that guide the extraction of knowledge of data.

This figure articulates the entire data science model to get value and insights for business.

Now, your data source could be a file, a database, a website, certain forms / surveys, an API, or a device. Whenever, we extract certain data from an API source, there is an underlying process that takes place, which is shown in the below figure. In this process, a request is sent from the client with certain additional information or condition to the API, to which the server responds.

Almost 20% of a data scientist's time is spent in gathering and collecting data sets, followed by around 60% of the time being spent in cleaning and organizing that data. Thus, exploring and processing data is a very critical stage to draw meaningful insights.

This particular stage of exploring and processing data is further broken down four stages -

Four stages of Data Organizing and Processing

Now, here Exploratory Data Analysis is done in 5 ways:

Basic structure (Number of rows or observations, Number of columns / features, Column data types, Exploring the head / tail of data)
Summary statistics (Numerical: Centrality measures, such as mean, median, mode of data sets; Dispersion measures, such as variance, standard deviation, range, percentiles. Categorical: Total count, Unique count, Category count and proportions, Per category statistics)
Distributions (Univariate distributions: Histogram, Kernel Density Estimation plot; Bivariate distributions: Scatter plot)
Grouping and aggregation of data using certain common conditions
Using crosstabs or pivots to classify data related to multiple features

In my next article, I will dig deeper into each method of Exploratory Data Analysis in detail with examples.

On a high level, exploratory data analysis sets the stage for further data cleansing and munging. Under data munging or wrangling, we treat missing values in columns for certain entries and also work with outliers. The reasons of non-availability of such missing items could be erroneous data entry process, non-availability from the source, or equipment error. As certain missing items and outliers might skew our information, we either delete them or imputate them by adequate measures (Mean Imputation, Median Imputation, Mode Imputation, Forward/Backward fill, Predictive Model) for our futher analysis. The outlier detection is quite easy by using a histogram or a boxplot or a scatter plot. The outliers are treated in the following ways:

Removal
Transformation
Binning
Imputation

Next, we move towards Feature Engineering. It is a process of transforming raw data to better representative features in order to create better predictive models. Some data scientists say that feature engineering is an art. It requires domain as well as technical knowledge. The three key ways feature engineering is done are: Transformation of features, Creation of features, Selection of features. Many a times, data scientists use this stage to convert all teh categorical features to numerical ones as predictive models in machine learning do not work with categorical features. For this purpose, we use Categorical feature encoding. There are several methods to do such encoding. One of the easiest ways is Binary Encoding, wherein categories such as Gender are broken down into Is_Male or Is_Female and are populated by binary figures of 0 or 1. The second way to do such encoding is Label encoding, wherein we encode the categorical features, using labels such as Low, Medium and High by 1,2 and 3 respectively. If you use python for making such predictive models, then a great way of making such features is by using One-Hot Encoding method.

Finally, the stage is set for building predictive models and advanced visualizations. One can start by building baseline models and improve those models by fine tuning them.

My next article would delve deeper into each of these stages, as there is so much to discover in data science and a lot more to experience. Hope you enjoy this journey, no matter when you start!

要查看或添加评论，请登录

Snigdha Kakkar的更多文章

Catastrophic Forgetting in LLMs

2024年8月13日

Catastrophic Forgetting in LLMs

Recent research has shed light on a critical challenge facing Large Language Models (LLMs): the phenomenon known as…

2 条评论
Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity

2024年6月16日

Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity

In this paper by Jeong et al. from KAIST researchers presents a novel framework named Adaptive-RAG that dynamically…

1 条评论
Advancing Knowledge Integration in Large Language Models (2 interesting RAG-related Research papers summarized)

2024年6月4日

Advancing Knowledge Integration in Large Language Models (2 interesting RAG-related Research papers summarized)

In the rapidly evolving field of natural language processing, researchers are continuously exploring innovative ways to…
Retrieval-Augmented Language Models: Enhancing Knowledge and Factual Accuracy (Summarizing selected Research Paper on RAG)

2024年5月29日

Retrieval-Augmented Language Models: Enhancing Knowledge and Factual Accuracy (Summarizing selected Research Paper on RAG)

In the ever-evolving landscape of natural language processing (NLP), researchers are continuously pushing the…

1 条评论
Elevating RAG: Multimodal Integration, Advanced Techniques, and RAG 2.0

2024年5月22日

Elevating RAG: Multimodal Integration, Advanced Techniques, and RAG 2.0

Multimodal RAG In the ever-evolving landscape of Retriever Augmented Generation (RAG), a new frontier has emerged – the…

4 条评论
Evaluating RAG Systems: A Comprehensive Approach to Assessing Retrieval and Generation Performance

2024年5月13日

Evaluating RAG Systems: A Comprehensive Approach to Assessing Retrieval and Generation Performance

In the realm of Retrieval-Augmented Generation (RAG) systems for Large Language Models (LLMs), a comprehensive…
Exploring the Capabilities & Limitations of GPT-4: OpenAI's Large Language Model (Popular LLM Series)

2024年5月8日

Exploring the Capabilities & Limitations of GPT-4: OpenAI's Large Language Model (Popular LLM Series)

Introduction On Pi Day (March 14, 2023), OpenAI unveiled their most advanced large language model, GPT-4. This new…

2 条评论
Enhancing Response Synthesis in Retrieval-Augmented Generation (RAG) Systems

2024年5月6日

Enhancing Response Synthesis in Retrieval-Augmented Generation (RAG) Systems

Introduction The final stage of the Retrieval-Augmented Generation (RAG) pipeline is the response synthesis, where the…
Deep Dive into Llama3 (Popular LLM Series)

2024年5月1日

Deep Dive into Llama3 (Popular LLM Series)

Introduction In this comprehensive newsletter, we will take a deep dive into Llama3 - Meta's latest open-source…

7 条评论
Optimizing Retrieval in Retriever Augmented Generation (RAG)

2024年4月29日

Optimizing Retrieval in Retriever Augmented Generation (RAG)

In the Retrieval phase of RAG, we explore three different approaches (Standard, Sentence window, and Auto-merging) to…

1 条评论

See all articles

A non-technical guide to Data Science

Snigdha Kakkar

?? Accelerate your AI career with daily insights! | 6x LinkedIn Top Voice (Generative AI, Data Science, Machine Learning) | Innovating in Generative AI space | Join 21K+ followers

Snigdha Kakkar的更多文章

社区洞察

其他会员也浏览了

The Data Science Workflow

8 Steps In Data Science Process Decoded – 4th One Is Amazing

Why is Data Science important?

Mastering 6 Effective Common Statistical Techniques Used In Data Science!

Exclusive Sneak Peak At What Is Data Science!

5 Things You Need To Know About Data Science

Data Science VS Data Analytics: What’s the Difference?

Statistical fundamentals that every Data Science beginner must comprehend

Refined Thinking like a Data Scientist Series

Benefits and Opportunities in Data Science & Business Intelligence

Snigdha Kakkar的更多文章

Catastrophic Forgetting in LLMs

Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity

Advancing Knowledge Integration in Large Language Models (2 interesting RAG-related Research papers summarized)

Retrieval-Augmented Language Models: Enhancing Knowledge and Factual Accuracy (Summarizing selected Research Paper on RAG)

Elevating RAG: Multimodal Integration, Advanced Techniques, and RAG 2.0

Evaluating RAG Systems: A Comprehensive Approach to Assessing Retrieval and Generation Performance

Exploring the Capabilities & Limitations of GPT-4: OpenAI's Large Language Model (Popular LLM Series)

Enhancing Response Synthesis in Retrieval-Augmented Generation (RAG) Systems

Deep Dive into Llama3 (Popular LLM Series)

Optimizing Retrieval in Retriever Augmented Generation (RAG)

社区洞察

其他会员也浏览了

The Data Science Workflow

8 Steps In Data Science Process Decoded – 4th One Is Amazing

Why is Data Science important?

Mastering 6 Effective Common Statistical Techniques Used In Data Science!

Exclusive Sneak Peak At What Is Data Science!

5 Things You Need To Know About Data Science

Data Science VS Data Analytics: What’s the Difference?

Statistical fundamentals that every Data Science beginner must comprehend

Refined Thinking like a Data Scientist Series

Benefits and Opportunities in Data Science & Business Intelligence