Categories of data scientists – where do you want to be?
I lay out these three broad choices in front of an aspiring data scientist seeking advice: Do you want to be a slave of a chief scientist or do you want to be among the masses or do you want to be deep into foundation?
The first category covers mostly those who are limited self-learners and may only have been exposed to an online course. They know the jargon of machine learning and input/output to a handful of techniques. Some moved from internal traditional IT departments to newly formed data science divisions hoping for a more rewarding and cooler career. Their managers use them basically for the time-consuming and laborious process of preparing data, tell them which package to learn along with specific techniques, and instruct them in setting the appropriate parameters. They often come with a good background for generating reports, mostly by querying existing data sources. They will not be able to explain even for a simple regression model what a p-value is but will know how to discard irrelevant attributes based on its value. They will also know what a neural network looks like or a decision tree. For these professionals, deep linguistics processing is still searching words in texts.
The second category constitutes most of the so-called professional data scientists. They come from a variety of disciplines with which they have started their career but subsequently obtained a master’s degree from any of the mushrooming university departments that have started offering degrees in data science. They have studied algorithms and know the parameters that affect algorithmic performances. They usually get goal-oriented assignments from their managers, meaning the problem statement is provided but the laborious process of collecting and preparing data is still left to them. They can run predictive algorithms in R and Python using various packages, vary the values of the parameters knowing their effect at a very high-level, and then select a combination that gives the best performance. They think of analytics as a “bag of tricks” meaning they adopt whatever techniques solve the current problem. But they still think that tensors are just blocks of data, AI is all about machine and deep learning, and “Bayesian” is just a buzz word. They will know eight or so popular machine learning techniques but it is unlikely that they will have any knowledge of gradient boosting type of algorithms. They will know well how to code deep learning in Keras/TensorFlow and AWS and in other similar platforms but will have difficulty explaining dropouts, vanishing gradients, etc., and the appropriateness of different activation functions under different circumstances.
Less than two percent of all professionals are in the third category and will continue to increase its share as the data science field matures. Professionals in the third category are those who have degrees in the foundational disciplines, such as mathematics, probability and statistics, linear algebra, and broadly the theory of computer science and artificial intelligence. Most not only know some of the algorithms well but also the foundational mathematics behind the algorithms such as the cost function formulation and techniques for convex optimization along with geometric interpretations. Many come with strong publication backgrounds and tend to solve everything with only a handful of techniques they have mastery of. They are therefore highly biased in their approach to solving problems. But they are likely to be aware of all the latest and greatest in the field. Many of this category lack practical usability sense and fail to explain the results to na?ve users. We are still not at that stage yet when an analytics system configures and adapts itself and hence the value of the professionals from this category to build the most efficient models.
The purpose of this broad subjective categorization is not to highlight the level of usefulness of professionals of one category versus another but rather to help you assess the strength and shortfall of your existing team against the need. In fact, you need a mixture of all three to successfully run a data science unit. You cannot make someone without a proper mathematical background do the job in the third category. Conversely, one interested in building the best model with deep algorithmic background cannot be asked to spend time doing routine modeling and data preparation all the time.
Now the question is – in which category do you belong?
Consultant at General Mills
5 年Great read sir. Thanks for sharing the same.
Basically you categorized them by experience.