An Intro to The Industry of the 21st Century - Data Science.  ?????

An Intro to The Industry of the 21st Century - Data Science. ????

We live in a world full of data where 2.5 quintillion bytes of data are produced daily (that’s 2.5 followed by a staggering 18 zeros!). This is why about 80% of firms across the globe are investing a large part of their earnings into creating a skillful data analytics division. Yet businesses are only limited to 5% of the information available because 80 to 90% of their data is mostly unstructured, meaning it's nearly impossible to organize and find the insights you need. However, data science is one of the only data jobs that deal with unstructured data, making it one of the most valuable jobs in the industry.?

So what is data science? The sexiest job of the 21st century.?


What Is Data Science?

No alt text provided for this image

Data science is the field of study that combines mathematics, computer science, and specific domain knowledge to derive meaningful information from data. Data scientists use machine learning algorithms (machines that imitate human behavior by finding patterns from data) to then create predictive models (models that predict likely future outcomes from historical and existing data) to help extract valuable information from both structured and unstructured data so businesses can make better decisions. Getting valuable results from data is a lengthy task and is known as the data science lifecycle.

No alt text provided for this image


How Does Data Science Work??

1.) Identify and understand the specific problem

Creating a specific and clear problem statement is one of the first and critical steps in all data science projects. Many companies are too vague when defining data problems such as:?

  1. I want to increase the revenues of my company.
  2. I want to predict stock prices.
  3. I want to recommend personalized products to customers on my website.?

So, it's the data scientists' job to communicate actively in meetings and ask the right questions to create a clear and goal-oriented problem statement.?

Here’s an example of a specific and well-defined problem statement that we will be using throughout the article:

I want to predict seniors falling before it happens.?

Having a well-defined problem statement gives data scientists a clear direction on which sources to collect data from.?


2.) Data Collection and Cleansing

Data collection is the process of gathering relevant information from a variety of sources. Depending on the problem trying to be solved, the method of data collection is divided into two categories.?

  1. Primary Data Collection: When you have a unique problem where no public data is available, new data must be collected through surveys and interviews.?

No alt text provided for this image
Example of a dog image dataset labeled with the breed of the dog.


  1. Secondary Data Collection: This is data from available open-source websites such as GitHub and kaggle.??


For our problem statement on predicting senior falls, you can collect data from online sources such as PointClickCare and also interview seniors in senior homes to collect data.?


After collecting data, one of the lengthiest and most tedious steps in the data science lifecycle comes into play: data cleansing. Data comes in a variety of formats and can be sorted into one of two categories: structured and unstructured data. Having the skill to work with unstructured data is exclusive to data scientists which makes them unique as using unstructured data requires an understanding of the topic of the data and understanding how the data is related. When combining multiple data sources, data can be incorrect, corrupted, incorrectly formatted, duplicates, or incomplete values which can end up generating inaccurate models and choosing insignificant variables for statistical analysis.?


3. Exploratory Data Analysis

After data collection and cleanup, we are finally able to perform data analysis and build familiarity with the data and see its potential. Data can be understood and analyzed through statistical and visualization methods which can be done through excellent open-source data science libraries.

Here are some examples:

NumPy - https://numpy.org/

No alt text provided for this image

  • Uses fast and flexible data structures that are designed to work with structured data very easily.??
  • Excellent tool for performing data analysis on data sets due to its strong and fast numerical computations with arrays and functions. Ex. average of a set of values or calculating standard deviation.?
  • Contains multidimensional arrays: capacity to hold different columns
  • Ideal for machine learning due to its capability to work in linear algebra


Pandas - https://pandas.pydata.org/

Uses fast and flexible data structures that are designed to work with structured data very easily.??


MatplotLib - https://matplotlib.org/

No alt text provided for this image

  • Extensively used for data visualization due to the graphs and plots it produces.?
  • Applications: correlation analysis of variables, outlier detection using a scatter plot.

In our example of predicting seniors' falls, we can calculate important factors such as calculating the number of steps taken per day, blood pressure, medications taken, gait, etc.?


Overall, exploratory data analysis is an important step since it helps us understand the data better so we can make a better model selection.?


Data Modeling

Data Modeling is the process of producing a descriptive diagram of relationships between various types of key data points used and stored within the database to achieve the solution. To build relationships between variables in the data, probability and inferential statistics are used.?


For seniors' data, we can propose if there’s a relationship between age and your risk of falling down, we could model the data into a graph that could look like this:?

No alt text provided for this image
Risk of falls & age relationship.


Data Communication?

This is the final step where results from the analysis are presented to stakeholders. Findings are usually presented to a non-technical audience such as the marketing team or business executives, so you must explain how you got to a specific conclusion and your findings. Results need to be communicated in a simple manner. Graphs and presentations are used to convey results and this is where the python libraries used above come into play.?

  1. Know your audience and speak their language?
  2. Focus on values and outcomes?
  3. Communicate assumptions and limitations


Key Takeaways:

  1. Data science is a very important field that's demand will increase in the future as companies continue to produce more data and become more data-driven in their approaches and decision-making.
  2. When implementing data science in a company, we go through a process in which at a high level we: choose our specific problem statement, gather data, filter it, analyze it & model it.
  3. There are lots of resources out there that make it easy to model & analyze data such as matplotlib, NumPy & Panadas!


Thanks so much for taking the time to read my article! If you have any questions comment down below or email me at [email protected] .

Until next time... ??

要查看或添加评论,请登录

社区洞察

其他会员也浏览了