ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Data Science in Industry: Monetize it.

Dr. Isil Berkun

AI Strategist | Founder of DigiFab.AI | 300K+ Learners on LinkedIn Learning | Former Intel Staff AI Engineer

å‘å¸ƒæ—¥æœŸ: 2019å¹´2æœˆ28æ—¥

+ å…³æ³¨

CRISP-DM Method

What is data science?

It is not a new area, except for the shiny buzz word. The roots of data science are old, it even goes back to 1800s. Data science is everywhere where data is. CRISP-DM method is a good schematic to help us understand the time cycle of a data science project.

Business Understanding :

First business understanding of the problem is the key, understanding what problem we are trying to solve, what are the advantages it will bring to the table and what are the disadvantages in the case if the problem is not solved. A car without a destination cannot do much, even if it is a Mercedes, BMW or Tesla. So first things first, we need a destination.

Data Understanding:

After the destination is set, we need to understand the data we have in hand. If we do not have enough data for this problem, making the right plan to collect the right data is the key here. Domain knowledge will come in extremely handy if you are an expert in your field, for instance, if you are expert in the finance field and the problem is a finance problem it will be much easier for you to understand the data. If you are not a subject matter expert, making small steps towards that goal would be highly beneficial along the way.

Understanding the types of data is key, we have two main data types: Categorical and Numerical. Categorical variables can be divided into two: nominal (blue, green, yellow etc) or ordinal (like apartment numbers: 234, 235,236). Numerical variables can be also divided into two: interval and ratio.

Data Preparation (commonly where the sweat and tears are, %80 of the project):

Once you have all the data you need for the problem, there it comes the fun (you better love it as it is the 80% of the data science): data preparation. In this world and age, it is a lot more common to jump into the machine learning algorithms without paying enough attention to the tedious data pre-processing piece. This would be total garbage in and garbage out scenario. Some of the main things that need to be checked are: is my data missing values (and the answer is yes %98.7 of the time!) then the question will be, how do we fill the data (if we decided to fill them, in some cases, we may choose to drop them). After implementing this step, we look at the statistical properties of the data, for example, imagine if we are looking at data with age and salaries: If 2 people are making 500K/year out of 100 people, where the average salary is 50K/year, we will need to treat the outliers properly. Or letâ€™s look at another example, imagine that we have a drone and we are trying to count the number of dinosaurs vs the number of ants from the top down, due to the size difference, we will have difficulty counting the ants unless we have a way to look at them with a fair glass. Similarly, we may have a variable with very large values vs another variable with very small values, in this case, we need to apply normalization or standardization techniques depending on the data in order to prepare them for the model.

One other example of the preprocessing piece is turning the categorical data (like blue, green, yellow) into numerical data (1,2,3). We may need to do that as some models only accept numerical data to work with. In this case, we have methods like one hot encoding and label encoding.

Modeling:

Let us start by saying even if you never used machine learning before, I guarantee you that you are a machine learning model, congratulations! Wellâ€¦ I hear you asking, how so? Letâ€™s think that you are in the line at a fast food restaurant and waiting to order your food, and just before you, there is this 6-year-old kid, what is the chance that he/she will order from the kidsâ€™ menu? Almost %98 unless he/she was not fed all day and starving. See what you did here, based on past experiences, which was your training data, you build a model and tested this model on this 6-year-old kid. Similarly, data scientists divide the data into test and train (%30/70 is a common split) and build the model only using the training data and then use the test data to evaluate the model. In this case, experience and knowledge will be highly effective which model to use, depending on the data and the business goal.

Evaluation:

We built our model, and now what? We need to test if the model does a good job, using the test data. Remember the 6-year-old kid above, did he/she really order from the kidâ€™s menu?

Depending on the machine learning models we have used, and the statistical bias in the data, our evaluation metrics will differ. To give an example, for a device pass/fail problem, if we only had %3 of the data fail, and the rest of them pass, likely our evaluation score may be high if all of our test data is a pass. So in that case, we have different evaluation methods we use to predict the success of the model (such as AUROC score etc).

Deployment: all this work was for this sole purpose: Monetize==1?

Suppose we found a successful model for our problem. To give an example, we have detected the customers who have a high possibility of abandoning their contract and the goal is to keep them happy so they can stay. The action plan for this might be offering them some discounts or calling them for a customer survey.

CRISP-DM is nice, but there is so much to learn â€“ where do I even start?

Database knowledge: Good to know the basics of SQL.
Programming knowledge: Python, R
Statistical knowledge: Detection of statistical bias, probability concepts.
Data Science concepts: Data pre-processing (including handling missing values, turning categorical to numerical data If needed, feature scaling etc).
Machine Learning concepts: Prediction and classification models and evaluation.
Deployment Experience: Take business actions based on data: such as reaching out to customers that are likely to leave your company, and offer them discounts.

Where to learn?

Education is the key, almost all universities have undergrad degrees in data science. Online classes are available in Coursera, EdX, Udemy, and Udacity. Open source classes are available in YouTube, and online tutorials are available free of charge.

Network:

Find a meetup or start one.

Networking in university, at work.

Future of Data Science:

Everything is a piece of data science today, smart watches, navigation devices etc. In the near future, it will be even more than this, data science will be able to recommend your products based on your childhood data. Everything is transitioning to a digital format.

Drones. Real estate with drones. How much does it cost 20 years later?

We will not need a license or insurance card and will use our fingerprints in the future.

Going abroad?

Travel, and blend in the different cultures and work with foreigners, it will not only help you learn data science but it will expand your life knowledge.