The Data Science Process
Daniel Wanjala
Machine Learning Engineer | Data Intelligence | FinTech | Predictive analysis | Deep Learning | Robotics | Candidate for a Master's degree.
The Data Science Process
The Data Science Process
Introduction
Just as it's important to understand the kinds of problems that can be solved by data science, it's also important to have a sense of the process used to conduct data science. In this lesson, we'll outline the lifecycle of a typical data science project - from business understanding through data visualization.
Objectives
You will be able to:
·????????Describe the full data science process
The Data Science Process
There is much more to data science than just selecting, applying, and tuning Machine Learning algorithms. A data science project will often include the following stages:
In this section, you will go through each of these stages and see what is involved.
Business Understanding / Domain Knowledge
Before trying to solve a data-related problem, it is important that a Data Scientist/Analyst has a clear understanding of the problem domain and the kinds of question(s) that need to be answered by their analysis. Some of the questions that the Data Scientist might be asked include:
·????????How much or how many? E.g., Identifying the number of new customers likely to join your company in the next quarter. (Regression analysis)
·????????Which category? E.g., Assigning a document to a given category for a document management system. (Classification analysis)
·????????Which group? E.g., Creating a number of groups (segments) of your customers based on their monetary value. (Clustering)
·????????Is this weird? E.g., Detecting suspicious activities of customers by a credit card company to identify potential fraud. (Anomaly detection)
·????????Which items would a user prefer? E.g. Recommending new products (such as movies, books, or music) to existing customers (Recommendation systems)
Data Mining
After identifying the objective for your analysis and agreeing on the analytical question(s) that need to be answered, the next step is to identify and gather the required data.
Data mining is a process of identifying and collecting data of interest from different sources - databases, text files, APIs, the Internet, and even printed documents. Some of the questions that you may ask yourself at this stage are:
·????????What data do I need in order to answer my analytical question?
·????????Where can I find this data?
·????????How can I obtain the data from the data source?
·????????How do I sample from this data?
·????????Are there any privacy/legal issues that I must consider prior to using this data?
Data Cleaning
Data cleaning is usually the most time-consuming stage of the Data Science process. This stage may take up to 50-80% of a Data Scientist's time as there are a vast number of possible problems that make the data "dirty" and unsuitable for analysis. Some of the problems you may see in the data are:
·????????Inconsistencies in data
·????????Misspelled text data
·????????Outliers
·????????Imbalanced data
·????????Invalid/outdated data
·????????Missing data
This stage requires the development of a careful strategy on how to deal with these issues. Such a strategy may vary substantially between different analyses depending on the nature of the problems being solved.
Data Exploration
Data exploration or Exploratory Data Analysis (EDA) helps highlight the patterns and relations in data. Exploratory analysis may involve the following activities:
·????????Calculating basic descriptive statistics such as the mean, the median, and the mode
·????????Creating a range of plots including histograms, scatter plots, and distribution curves to identify trends in the data
·????????Other interactive visualizations to focus on a specific segment of data
Feature Engineering
A "Feature" is a measurable attribute of the phenomenon being observed - the number of bedrooms in a house or the weight of a vehicle. Based on the nature of the analytical question asked in the first step, a Data Scientist may have to engineer additional features not found in the original dataset. Feature engineering is the process of using expert knowledge to transform raw data into meaningful features that directly address the problem you are trying to solve. For example, taking weight and height to calculate Body Mass Index for the individuals in the dataset. This stage will substantially influence the accuracy of the predictive model you construct in the next stage.
Predictive Modeling
Modeling is the stage where you use mathematical and/or statistical approaches to answer your analytical question. Predictive Modeling refers to the process of using probabilistic statistical methods to try to predict the outcome of an event. For example, based on employee data, an organization can develop a predictive model to identify employee attrition rates in order to develop better retention strategies.
Choosing the "right" model is often a challenging decision as there is never a single right answer. Selecting a model involves balancing the accuracy and computational cost of the analysis process. For example, some recent approaches in predictive modeling such as deep learning have been shown to offer vastly improved accuracy of results, but with a very high computational cost.
Data Visualization
After deriving the required results from a statistical model, visualizations are normally used to summarize and present the findings of the analysis process in a form that is easily understandable by non-technical decision-makers.
Data visualization could be thought of as an evolution of visual communication techniques as it deals with the visual representation of data. There is a wide range of different data visualization techniques, from bar graphs, line graphs, and scatter plots to alluvial diagrams and spatiotemporal visualizations, each of which will work better for presenting certain types of information.
Summary
In this lesson, we looked at the end-to-end Data Science process to give a sense of the activities that Data Scientists engage with.
Problems Data Science Can Solve
Introduction
In this lesson, we will look at what data science is and the different kinds of problems that it can be used to solve. By the end of the lesson, you should be able to answer which technique you would use as a professional data scientist for a particular business problem.
Objectives
You will be able to:
·????????Describe the problems data science can solve
What Problems Can Data Science Solve?
Congratulations on deciding to become a data scientist! Before we dig into the details of the tools and techniques that you'll need to learn, it's important to take a little time to understand what you'll be able to do once you graduate. Here is a list of some of the common types of business problems data scientists are expected to solve.
领英推荐
1. Regression: How much or how many?
Regression analysis is used to predict a continuous value - such as the number of staff you'll need for a busy shift or the likely sale price of a house.
Example: Sales or Market Forecasts
Traditional trend analysis only looks at how one business entity changes with respect to another. Regression analyses can provide insight into how an outcome will change when several other variables are modified.
2. Classification: Which category?
Classification is used to predict which category something will fall into. If you're trying to figure out whether a client is likely to default on a loan (i.e., default or no default) or which of your products a customer is likely to prefer, you're dealing with a classification problem.
Example: Credit Rating
Credit card companies receive hundreds of thousands of applications for new credit cards every week. These applications contain detailed information on the social, economic, and personal attributes of applicants. Classification analysis can allow companies to categorize these applications based on the quality of their credit.
3. Anomaly detection: Is this weird?
Anomaly detection is a common data science technique used to find unusual patterns that do not conform to expected behavior. It has applications across various industries from intrusion detection (identifying strange patterns in network traffic that could signal a hack) to fraud detection in credit card transactions to fault detection in operating environments.
Example: Identifying Fraud
This approach focuses on finding?outliers?in the data that appear to have unusual patterns. This serves as a first indication of the presence of fraudulent activity. Such approaches are also frequently applied by large social networks like Facebook, Twitter, etc.
4. Recommender systems: Which item would a user prefer?
Recommender systems are one of the most popular applications of data science today. They are used to predict user preferences toward a product/service. Almost every major tech company (Amazon, Netflix, Google, Facebook) has applied them in some form or the other. You might have noticed phrases like "If you like this product, you may also like ...", "Users who bought this item also bought ...", and "Based on your preferences, we recommend following products to you ...". You got it, these are all recommender systems in action.
Recommender systems can help a business retain customers by providing them with tailored suggestions specific to their needs. They can help increase sales and create brand loyalty through relevant personalization. When a customer feels as though they are understood by your brand, they are more likely to stay loyal and continue purchasing through your site. According to a recent study by McKinsey, up to 75% of what consumers watch on Netflix comes from the company’s recommender system. Retail giant Amazon credits recommender systems with 35% of their revenue. Best Buy decided to focus on its online sales, and in 2016’s second quarter it reported a 23.7% increase, thanks in part to its recommender system.
Summary
While you're going to learn to use a wide range of tools and techniques throughout this course, most of them will be used to predict a continuous value, decide the most likely category for a value, identify anomalies, or provide recommendations.
Data Privacy and Data Ethics
Introduction
Data ethics and data privacy are integral to any data project. There are obvious cases such as protecting the privacy of individual health records under HIPAA. There are also many gray areas surrounding what constitutes personally identifiable information (PII) which occur throughout many industries including advertising, finance, and consumer goods. You may have noticed that starting around the summer of 2018, you started receiving privacy policy notices on many websites asking you to accept the use of cookies. This was a result of Europe's GDPR legislation. You are also probably aware of the Cambridge Analytica debacle in the 2016 United States presidential election. As a data practitioner, it is your responsibility to uphold data ethics in a fast-changing environment.
Objectives
You will be able to:
·????????Determine whether or not a data science procedure meets an ethics standard
Examples
Data Breaches
If the data you are handling is valuable, then security should be a primary concern. Data breaches are all too common and often, such leaks of sensitive information could have been avoided if businesses and organizations followed standard security protocols. While there are thousands of said cases, two of the biggest breaches which have caught the public's attention include Cambridge Analytica's misuse of Facebook data to influence political elections, and Equifax's leaking of roughly 100 million individuals' social security numbers and credit scores.
Identifying PII
PII stands for?personally identifiable information. While some cases such as one's social security number and medical records are clear examples of PII, other pieces of data may or may not qualify as PII depending on the jurisdiction. In the United States, for example, there are two federal regulations: the Health Insurance Portability and Accountability Act (HIPAA), and the Privacy Act of 1974. While in theory, these acts aim to protect the use, collection, and maintenance of personal data, the scope of what constitutes PII and the subsequent regulations surrounding handling and using said data is generally antiquated. For example, a user's IP address has been categorized as non-PII by several U.S. courts despite it being a unique identifier to most individuals' home internet connection. This was further eroded by the rollback of net neutrality laws by FCC Chairman Ajit Pai in mid-2018. Aside from federal jurisdiction, several states, most notably California have their own data protection laws for the benefit and protection of users and consumers.
GDPR
GDPR stands for general data protection regulation. It was passed on April 14th, 2016 by the European Union and went into effect on May 25th, 2018. GDPR protects the data rights of all European citizens and is an example of how the legislation will have to change and adapt to the online digital era of the 21st century. GDPR has implemented more widespread regulations surrounding what constitutes PII and has set fine structures for up to 4% of a company's revenue.
Data Best Practices
There are two primary practices that you should follow when dealing with PII and other sensitive data. The first is to encrypt sensitive data. When in doubt, encrypt. Secondly, ask yourself what level of information you really need. Large organizations will always include data cleaning teams which will first scrub sensitive data such as names and addresses before passing said data off to analysts and others to mine. Ultimately, any well-thought strategy will include multiple layers, safeguards, and other measures to ensure data is safe and secure.
Data Collection Processes
When collecting data, it is important to ensure you are not gathering it in a manner that will generate bias. For example, if Data Scientists are not careful in the way they phrase questions in surveys, they can generate misleading results. If a poll contained the question "How poorly has Politician X performed when it comes to the economy" it adds a negative connotation to the question. That phrasing might make people say Politician X performed worse than if they had merely been asked "How has Politician X performed when it comes to the economy?"
In some cases, choosing which variables to collect and how to define them can also contain bias. You’ll notice that in some of the datasets we use, gender is represented as a binary value, and race is referenced in an insensitive manner. This is an artifact of the societal conditions at the time the data was collected. As soon-to-be Data Scientists, it will be your responsibility to ensure that data collection is done in an inclusive manner.
Algorithm Bias
People often trust algorithms and their output based on measurements such as "this algorithm has 99.9% accuracy". However, it should also be noted that while algorithms such as linear regression are mathematically sound and powerful tools, the models are simply reflections of the data that is fed in. For example, logistic regression and other algorithms are used to inform a wide range of decisions including whether to provide someone with a loan, the degree of criminal sentencing, or whether to hire an individual for a job. (Do a quick search online for algorithm bias, or check out some of the articles below.) In all of these scenarios, it is again important to remember that the algorithm is simply reflective of the underlying data itself. If an algorithm is trained on a dataset where African Americans have had disproportionate criminal prosecution, the algorithm will continue to perpetuate these racial injustices. Similarly, algorithms trained on data reflecting a gender pay gap will also continue to promote this bias. With this, substantial thought and analysis regarding the problem setup and the resulting model are incredibly important.
Below is a handful of resources providing further information regarding some of the topics discussed here.
Gray Areas and Forward Thinking
Aside from overtly illegal practices according to current legislation, data privacy and ethics call into question a myriad of various thought experiments. For example, should IP addresses or cookies be considered PII? How should security camera footage be handled? What about vehicles such as Google street view cars which are capturing video and pictures of public places? Some companies are now even taking pictures of license plates to track car movements. Should they be allowed to maintain massive databases of said information? What regulations should be put on these and other potentially sensitive datasets?
All of these examples question where and when limits should be put on data. Science fiction stories such as 1984, are much more accurate than one might expect. Moreover, injustices and questionable practices still abound. For example, despite public outcry at debacles like Cambridge Analytica, many companies still exist with nearly identical practices such as?ApplecartLinks to an external site.?in New York City, which collects and sells user data to the Republican party, amongst others.
In staying current, you should also identify some news sources to stay up to date on tech trends.
One great resource is the?Electronic Frontier Foundation (EFF)Links to an external site.
EFF recently put together an article called Fix it Already, outlining fixable mishaps by technology companies that continue to be ignored. Take a look at the article?here and the links to an external site.?and get involved to put pressure on these organizations and your representatives to shape up. Here's a quick preview of their list:
· Android should let users?deny and revoke apps' Internet permissionsLinks to an external site.
· Apple should let users?encrypt their iCloud backup links to an external site.
· Facebook should?leave your phone number where you put links to an external site.
· Slack should give?free workspace administrators control over data retentionLinks to an external site.
· Twitter should?end-to-end encrypt direct messages links to an external site
· Venmo should let users?hide their friend lists links to an external site
· WhatsApp should?get your consent before you’re added to a group link to an external site.
· Windows 10 should let users?keep their disk encryption keys to themselvesLinks on an external site.
Disclaimer
As a final note, it should also be noted that the nature of online data can also include offensive or inappropriate data at times. For example, if acquiring data from an API such as Twitter, there is potential to encounter lewd or offensive material. While many of these services will eventually screen out and remove particularly egregious cases, plenty of trolls still exist.
Additional Resources
There's a multitude of resources to get involved with data privacy and ethics, but here are a few to get you started.
·????????GDPRLinks to an external site.
·????????HIPAALinks to an external site.
Summary
In this lesson, you got a preview of some of the many issues regarding data privacy and ethics. From GDPR to being aware of your own data aura, there's plenty to keep you busy and on your toes regarding this fascinating perspective on the data industry.