4th Story – Lies, Damned Lies and Statistics
Jack Raifer Baruch
Founder | Head of Data Science at ADA Intelligence. ??Award winner: Safe and Trustworthy AI #GESAwards. Data | Psychometrics | SocialEmotional Skills
Back in 1907, author Mark Twain wrote:
“Figures often beguile me, particularly when I have the arranging of them myself; in which case the remark attributed to Disraeli would often apply with justice and force: ‘There are three kinds of lies: lies, damned lies, and statistics.’”
Although we have no idea who the original author of the last phrase is, Twain did make it popular, and with a particular purpose in mind, to let people understand how numbers, and hence statistics, can be manipulated to lie, hence becoming some of the most terrible of all lies.
And what happens in a discipline where statistics lie at its core, like Data Science? Well, we can also lie with our models, and it can be quite simple to do so, especially when we create complex and hard to explain models. But this story is not about the lies, the ethics or the incredible ease with which people can be manipulated with statistics. It is the opposite, understanding the incredible power that this mathematical discipline brings to our lives, why it is the center of Data Science and why you should learn it in all its glory.
The power of statistics can be summed up in one phrase: it allows us to understand an underlying truth with limited information. Let us unpack that understated statement. Even in today’s exabytes per day world, we still are not even close to having all the information in the world, far from it. This means that, when you work with data, you are only working with a subset of the real population, every single time. So, to be able to analyze and create models with only a sample of the population, we need ways to extrapolate those limited pieces, infer conclusions and be able to calculate the probability (another subject for another story) that we might be, to a certain level of certainty, correct, or at least close to.
This is where statistics come in, by giving us a set of tools and mechanics that allows us to come up with amazing insights and predictions, with only partial data.
And that is it, that is the whole reason statistics is at the heart of data science, and why it is vital to learn all you can about it. From understanding the Central Limit Theorem, Z-scores and t-scores, standard deviations, quantiles, the difference between mean, median and mode and when and how to use each of them. If you ever heard of the Six Sigma model (that offers green, yellow, and black belt certifications as if it were karate), it comes from the fact that, in a normal distribution, 99.7% of the population is within 3 standard deviations of the mean. If you understood that, great, you already comprehend some statistics, if you did not, please, please, please, start studying it right now.
Then there is the other particularly important part of statistics, most visualizations of data, come from this discipline, from Gaussian Distribution (the fancy name for normal distribution) plots, to histograms, box plots, and my all time favorite, violin plots (check these ones out, one image, TONS of information). And these visualizations are the bread and butter of your work as a data scientist, from EDA (exploratory data analysis), to being able to communicate your insights, your ideas and how your models work.
And here is where we go back to those lies, damned lies and statistics. Understanding how easy it is to lie with numbers, to manipulate them and to misguide others, is where I understood how important it is for every one of us to rise above, be bold and brave, learn to be clear, to communicate openly and get others to benefit all the great things we can discover.
Yes, statistics can be misused, just like anything else, but they are also the path to creating understanding and to build amazing products and services which can profoundly change the world. For me, it is this idea, to use data science, to use statistics in a way that we can reverse the manipulation of many systems today, to combat bias in both human and machine, and to build through machine learning the tools that can take humanity into a new age of abundance.
Hope my ramblings today did not scare you, and on the contrary, inspire you to learn more about statistics and use them for the good of data science and the community.
Coursera, Kahn Academy, Udemy, Udacity, YouTube are just some of the resources where you can learn statistics, for free or paid. And I do highly recommend you read some of Alberto Cairo’s work on information design.
Hope we cross paths through our Journeys…
Jack Raifer Baruch
Follow me on Twitter: @JackRaifer
Follow me on LinkedIN: jackraifer
Next Story: Pandas, Pandas, Pandas… Libraries are your friends.
About the Road to Data Science Series
Today, I am working on the first steps of remarkably interesting projects for human development based on Data Science and Machine Learning.
But not that long ago (really, not long at all) I knew extraordinarily little about data science and much less what it all meant (and I am still learning more and more about it every day). In my quest for reinventing myself from Psychologist working in Behavioral Economics to Data Scientist I went through an incredibly interesting journey and learned a lot. This series is mostly a letter to my past self, to help anyone like me take this amazing road and, luckily, avoid some of the mistakes I made on the way due to lack of knowledge or perspective.
Hope you enjoy my ramblings as much as I found joy on my Road to Data Science.
Need Help on your Journey?
This can be a difficult path alone, so feel free to reach out to me through LinkedIN or Twitter. I started this series because of the #66DaysOfData initiative by Ken Jee, it is a great way to connect and get support, so just check out Ken on twitter @KenJee_DS and join the #66DaysOfData challenge.
Learning Resources I have Used:
A LOT of content, some free, most paid. Check out cupon sites where you can usually find free cupons for courses on python, R, data science, machine learning and much more.
Interesting place to learn, they have some free courses and then paid content. Very hands on coding exercises, few videos, mostly reading.
My favorite place to learn. Thousands of courses, a lot of content on programming, Data Science and Machine Learning. The University of Michigan has many courses here for python programming from the very basics to complex things. All courses are free to audit, you only pay if you want to earn a certificate.
The top free place to learn to code. Hundreds of hours of free videos on almost any language. They now also have certifications, also for free.
The place to learn anything. All of it is free, it might take a while to get to the content you want and enjoy.
Top site for data science, also run many competitions. They have many free courses, but the programming part is scarce, some basic ones and all focused on Data Science and Machine Learning.
Similar to Codecademy, with many paths and courses. Some free content, the rest is paid. Very focused on Data Science.
My favorite place to practice code, challenges for every level from beginners to advanced. This is a good place to challenge yourself and check your progress.