An Intro to The Industry of the 21st Century - Data Science. ????
We live in a world full of data where 2.5 quintillion bytes of data are produced daily (that’s 2.5 followed by a staggering 18 zeros!). This is why about 80% of firms across the globe are investing a large part of their earnings into creating a skillful data analytics division. Yet businesses are only limited to 5% of the information available because 80 to 90% of their data is mostly unstructured, meaning it's nearly impossible to organize and find the insights you need. However, data science is one of the only data jobs that deal with unstructured data, making it one of the most valuable jobs in the industry.?
So what is data science? The sexiest job of the 21st century.?
What Is Data Science?
Data science is the field of study that combines mathematics, computer science, and specific domain knowledge to derive meaningful information from data. Data scientists use machine learning algorithms (machines that imitate human behavior by finding patterns from data) to then create predictive models (models that predict likely future outcomes from historical and existing data) to help extract valuable information from both structured and unstructured data so businesses can make better decisions. Getting valuable results from data is a lengthy task and is known as the data science lifecycle.
How Does Data Science Work??
1.) Identify and understand the specific problem
Creating a specific and clear problem statement is one of the first and critical steps in all data science projects. Many companies are too vague when defining data problems such as:?
So, it's the data scientists' job to communicate actively in meetings and ask the right questions to create a clear and goal-oriented problem statement.?
Here’s an example of a specific and well-defined problem statement that we will be using throughout the article:
I want to predict seniors falling before it happens.?
Having a well-defined problem statement gives data scientists a clear direction on which sources to collect data from.?
2.) Data Collection and Cleansing
Data collection is the process of gathering relevant information from a variety of sources. Depending on the problem trying to be solved, the method of data collection is divided into two categories.?
For our problem statement on predicting senior falls, you can collect data from online sources such as PointClickCare and also interview seniors in senior homes to collect data.?
After collecting data, one of the lengthiest and most tedious steps in the data science lifecycle comes into play: data cleansing. Data comes in a variety of formats and can be sorted into one of two categories: structured and unstructured data. Having the skill to work with unstructured data is exclusive to data scientists which makes them unique as using unstructured data requires an understanding of the topic of the data and understanding how the data is related. When combining multiple data sources, data can be incorrect, corrupted, incorrectly formatted, duplicates, or incomplete values which can end up generating inaccurate models and choosing insignificant variables for statistical analysis.?
3. Exploratory Data Analysis
After data collection and cleanup, we are finally able to perform data analysis and build familiarity with the data and see its potential. Data can be understood and analyzed through statistical and visualization methods which can be done through excellent open-source data science libraries.
领英推荐
Here are some examples:
NumPy - https://numpy.org/
Pandas - https://pandas.pydata.org/
Uses fast and flexible data structures that are designed to work with structured data very easily.??
MatplotLib - https://matplotlib.org/
In our example of predicting seniors' falls, we can calculate important factors such as calculating the number of steps taken per day, blood pressure, medications taken, gait, etc.?
Overall, exploratory data analysis is an important step since it helps us understand the data better so we can make a better model selection.?
Data Modeling
Data Modeling is the process of producing a descriptive diagram of relationships between various types of key data points used and stored within the database to achieve the solution. To build relationships between variables in the data, probability and inferential statistics are used.?
For seniors' data, we can propose if there’s a relationship between age and your risk of falling down, we could model the data into a graph that could look like this:?
Data Communication?
This is the final step where results from the analysis are presented to stakeholders. Findings are usually presented to a non-technical audience such as the marketing team or business executives, so you must explain how you got to a specific conclusion and your findings. Results need to be communicated in a simple manner. Graphs and presentations are used to convey results and this is where the python libraries used above come into play.?
Key Takeaways:
Thanks so much for taking the time to read my article! If you have any questions comment down below or email me at [email protected] .
Until next time... ??