The Data Science Dilemma - B
Rajesh CNB
Changing perceptions through content creation and story telling | Lead Magnets, Whitepapers, E Books, Slide Decks | Content Strategy | ??????, ????? Content creation
It is astounding as to how less attention people pay to their passion when choosing their career.
I mean, is it not logical to see if you can earn money while doing something that you really love to do, rather than do something that you don't really like and then struggling with Work-Life Balance!
I find that term itself a bit misleading. It allows people to think that work is different from life and they are two different worlds. To me work is a part of life and it needs to be balanced with all other parts of life like family, social relationships etc., Well, this is not an article about Work Life Balance, but about a career in data science. Let us get to the point!
So, once you have decided that you love data and you are willing to go the whole nine yards into the data science field, then the next step is to look into what skills you should begin to acquire for the job. In this part, I will dig into that.
What Knowledge and Skills should a data scientist have?
Data Science, as we know it today, is an amalgam of three different streams of study. (A) Statistics (B) Programming and (C) Business/Domain Knowledge. Let me elaborate from last to first.
(C) Business/Domain Knowledge: Data/Information on it's own has no real value. It gains value only when it finds an application. News that Donald Trump is not really favorable to Indians working in the USA, is not relevant and has no value to a construction worker doing his job at a site in India. But, it begins to have value, if his son/daughter were in college and have gotten a scholarship to go to USA, but now because of the new policy they may not go there! In any domain (more so in business), one needs to know how to use data to arrive at meaningful insights that can drive the direction of business. Collecting, Processing, Analyzing, Interpreting and Generating Insights from data can be effective and fruitful only when one is clear about how the data will be useful to the given domain. So! What is the problem?
Today's problem is the problem of excess. And the solution is the ability to Curate.
To curate is to select, organize, distill, clarify, contextualise, filter, connect, elevate, contrast, juxtapose, narrate and a little more. We have excess of data. Large volume of it exists. From this excess we need to curate the data we need and then proceed to analyse it. That is the biggest challenge and without knowing anything about a given domain, curation is not possible.
(B) Programming: A lot of beginners have asked "Can I learn data science without learning programming?" And my answer is Not Really! There are two challenges that make the use of computers (and hence by logical extension, programming!) a necessity.
The first is the amount of data one can generate from the web. I mean, look at the amount of data that our social networks generate every single minute. The more we go online, the more data we generate and this data can be pretty useful in different domains (one of which is obviously, Business!). To believe that such volume can be handled manually is to believe that you can drain out a Tsunami flood with a tea-spoon! It can be done, no doubt, but...
The Second is that the data is not available to you in a form that is amenable to analysis. For examples, the product reviews on amazon or the number of job changes a person has done in the last five years is all available on Amazon or LinkedIn, but it is not directly amenable to analysis. We need to programmatically collect the data, process it, fine-tune it (in data science we call this process as "Data Wrangling") to get it into the form that makes analysis possible. And without computer programs and software, this is not possible.
Hey! But then, this is not a show-stopper.
Many of the programming languages such as R and Python used in data science are high-end. They use English like statements and are simple to understand.
Not that you needn't sweat at all. But, you spend like 3 months or so to know it, you can ace it. Further, if you need some complicated stuff done, you can always hire someone geeky from youngistan who knows how to program! But then, you should be able to guide him/her adequately and solve any problems that come in the way!
(A) Statistics: I need not emphasise the use of statistical modelling. It is needless to say that a sound knowledge of statistics is required to make scientific inferences that will allow us to conclusively say something.
A data scientist is expected to develop statistical models that would process the data and give information that will further lead to insights into the contextual domain.
Basic techniques such as regression, correlation, knowledge of probability distributions, descriptive statistics are the most essential requirements. Advanced data science techniques like Machine Learning, Natural Language Processing etc., have specific statistical tools which can be learnt based on interest as well as opportunity for exposure.
Let me have a disclaimer here! I am not saying that one must be an expert in all three areas. But one must be an expert in atleast one area and then have a thorough knowledge of the basics of the other two. The difficulty with learning is primarily our education setup.
Most learning in data science is self-driven and online as it is hard to find established institutions that offer authentic certification courses.
IITs and IIMs have embarked on executive education and have introduced basic and fundamental courses in data science in their regular curriculum, but we are yet to see a large number of institutions embracing this emerging field into their curriculum. Given the laid back approach of our education system, this may take some time. But eventually, we shall get there!
In the next concluding post, let us look at two questions! Where do I begin? Where would I be eventually landing up?