Data is the New Language in the Information Age
Pushker Ravindra
Data Science | Engineering | Analytics | Computational Biology
Which is the most widely spoken language in the world? Did you say English? We have so much focus on English in India that we think English would be the?de facto?language of the world, without realizing that there are many developed countries like Spain, France, Japan where nobody speaks English as such. What about Chinese? By the way, there is no language called Chinese; In China, people speak Mandarin. Isn’t it obvious that Mandarin would be the most widely spoken language in the world? Yes, it is. But with the same logic, the second number should go to some Indian language. Now, this is interesting. India is one of the rarest countries that doesn’t have a national language. The second slot is taken by something unexpected?i.e.?Spanish. English comes at number three and after that Hindi with around 300 million people. Sometimes not having a national language has a disadvantage too. Had we had, it would have been at least the second most spoken language in the world. For those who may start arguing that Hindi is our national language, it’s the right time to correct your facts.
Now, why do we need languages? Can we live without languages? I come in an office cab I see three of the guys in the cab talking in sign language. Two of them can't hear or speak and one of them being their friend has learned sign language. My company is one of the top employers for differently abled workers and the only agricultural company to earn this status. Every time I see them talking through signs I feel pity and wish that they could speak like us and hear each other. But when I see them so engaged and happy I start feeling pity about ourselves who can speak but never speak with one another. By the way, sign language is also a language. Sometimes I wonder whether it is enough to know one or two languages to communicate effectively.
Recently when we introduced Data Science to students of UAS (GKVK), Bangalore, I asked them, ‘Which is the highest-ranked agricultural university in India?’ Not surprisingly, unanimously the answer was UAS, Bangalore. When I showed ranking from Careers360 magazine (Figure 1), most of the students were shocked.
Figure 1: Ranking of Indian agricultural universities (Source: Carreers360).
UAS, Bangalore was far behind many agricultural universities in India. This showed that a lot of things we assume are without a solid data evidence. Now we may argue that this is not the only ranking. You are right, but
“Without data you’re just a person with an opinion” - W. Edwards Deming
One student, who was quite smart pointed out that UAS, Bangalore is ranked second in terms of creating an impact. I liked the fact that some of the students slowly started thinking about data. They started analyzing it. That’s what my objective was. We together looked at multiple factors why UAS, Bangalore is not the top agriculture university. If you can create such a high impact what is lacking? The data suggested that their publication output and productivity are quite low yet had a high impact. That means whatever they publish is of high quality; however, the amount of research done is still less. We also discussed how we could increase research activities. Another surprising thing to notice was their contribution to IP (Intellectual Property) was zero. I remember when I was at IGIB, one of the striking things was their strong IP record. Prof. Sameer K Brahmachari (SKB) was always promoting innovation and IPs. It was interesting to see that the students slowly started building their opinions based on data.
“Without an opinion, you're just another person with data” - Milo Jones and Philippe Silberzahn
It was heartwarming to see so much excitement and eagerness from a group of more than 40 students from an agriculture background who wanted to learn data science.
Data is getting more popular every day
Today can you think of a company that is big and famous and doesn’t work with data? I love the way Google scans through emails containing flights or train tickets and tells us about our travel plans. Or when you are inside a mall, Google Maps lets you know that you are in a mall and asks whether you would like to provide a review. Google and Facebook both analyze data to display contextual advertisements. Amazon utilizes our shopping history and other similar users’ data to recommend products. Twitter data (tweets) are analyzed to find the latest trend/buzz in the industry. Uber decides surge prices based on peak hours demand. In the recent past, every company has used the power of data. I remember in one of the data analytics conferences, a speaker gave an example of The Climate Corp, a weather data company (sold for $1 billion) that helps farmers improve profitability by weather monitoring, agronomic modeling, and weather simulations. What does Climate have – a lot of quality data related to weather. These are few examples proving that data is becoming more important every day and it's slowly becoming the new language in the information age.
Figure 2: (A) Top Programming Languages (IEEE Spectrum Survey 2016). (B) Top Analytics/Data Science Tools (KDNuggets Survey 2016).
Ironically to understand this language (i.e. data) we need to learn a few more languages – i.e. programming languages. As per IEEE Spectrum survey 2016, C is the topmost preferred programming language in the world (Figure 2A). One of the reasons could be that it is still the most preferred language for embedded systems. But if you talk about Data Science, R and Python are the most popular and widely used languages (KDNugget survey, 2016; Figure 2B). IEEE Spectrum Survey talks about the most popular languages for embedded systems, web, mobile, and enterprise but it doesn't mean that other languages such as Python can't be used for mobile (kivy) or R can't be used for web (Shiny). We could keep arguing whether R is better or Python but there is no denying the fact that you need to know one of these to say that you work with Data Science. A wise person would exploit the power of both languages. If you want a very good comparison between these two, read more on data camp. But to make you happier, most of the studies suggest that R pays more as compared to Python. I have programmed in many languages starting from C, C++ to Java, Perl, PHP, and R and I feel that R would be the simplest to start with. Still, if you talk about data science, programming is only one component, you need to have statistics and domain knowledge too.
领英推荐
How to acquire Data Science skills?
There are multiple MOOCs available to learn Data Science skills. I prefer Coursera and edX. But before you start, the most important question is why you are doing it. What questions do you want to ask from your data? And once you are sure about it, you can go for the following courses.
Coursera: Data Science Specialization by JHU
edX:
For R Programming I also like O’Reilly’s Code School and Bootcamp. They provide interactive sessions where you can try R interactively without any installation.
Figure 3: (A) Data Science (Source: Drew Conway). (B) Data Cleaning in Progress (pun intended).
MOOCs are not enough
If you think you would become Data Scientist just by doing these courses you may be wrong. You will need some practical exposure to real data and work on it. Most importantly you need a passion for data. Do not forget that Data Science has three components and one of them is “Domain Expertise” (Figure 3A). Therefore, you can’t ignore domain knowledge at all. It takes years to be a domain expert.
"When domain experts start coding, they do miracles" - Pushker Ravindra
One of my colleagues always asks me that it looks so simple to do predictive analytics, then where the challenge is. For example, you just need to partition your data into training and test sets. Apply a model, validate your test set and get accuracy. If your accuracy is higher your prediction model is working. In R or Python, you can do this in a few lines of code. But understanding your data is very important, deciding what features to choose for modeling will come only by experience. If you try to do it in a high throughput way i.e. take all features and do your modeling you will add noise only. Research suggests that more than 80% of the time is spent on data cleaning by data scientists. Data cleaning will require an understanding of data.
You can get away from data but for how long? If your organization hasn't utilized the power of data science yet, you are missing something very big. If not today, certainly one day you must learn this new language.?
Disclaimer: Opinions expressed are solely my own and do not express the views or opinions of my current or previous employers.
Please feel free to share if you like.
Six Sigma >> Statistical Process Control >> Non clinical Statistics >> CPV >> Advanced Analytics >> Industry 4.0
3 年Nice article Pushker Ravindra. Taking cue from this article, I have completed The Analytics Edge course by MIT on edx....and liked this quote "when domain experts start coding, they do miracles".
Mechanical Design Engineer
4 年Very informative article, Thank you
Director, AI & ML at Microsoft | Investor | Mentor
4 年Good article Pushker.
Assistant Manager
4 年Wonderful article!!
Genomics | Bioinformatics | Data Science | Driving Precision Medicine through Data Insights
7 年Great Article. Thank you for the insights.