How to educate yourself in Data Science
Rishi Sapra ACA, MCT, Microsoft MVP
Group Manager at Avanade | Microsoft Most Valuable Professional (MVP) | Fast Track Recognised Solution Architect (FTRSA) | Chartered Accountant (ACA) | Microsoft Certified Trainer (MCT) | Quantic Executive MBA (Hons)
I recently took part in a panel discussion/series of short presentations on ‘How to Educate yourself in data science’ at a meetup event organised by London Business Analytics Group (LBAG) and hosted at the Microsoft Reactor in London.
We had some great feedback from the event and it sparked a lot of thought-provoking discussion afterwards! For those who were unable to make it, or for those who would benefit from reflecting on this themselves, I’ve provided below a summary of the key points covered in the presentations and subsequent Q&A. The slides and a transcript of Bart’s session are available here.
First up in the lightning talk order, Aimee started off with the challenge of trying to define what data science is and settled on “Data Science is the proactive use of data and advanced analytics to drive better decision making.”
That’s clearly quite a broad definition, one that isn’t specific to any methodologies and tools/technologies, but that’s the point - data science is an incredibly broad area that requires a breadth of skills including those that aren’t always traditionally in the job spec of a data scientist. Aimee highlighted 6 key skills in particular:
· Communication
· Visualisation
· Modelling
· Programming
· Data Wrangling
· Technology
It’s pretty much impossible to find someone who’s an expert in all these areas, we specialise in the areas which we’re naturally good at (those which play to our strengths) and those which we enjoy the most. The key takeaway from this is then that not everyone who’s a data scientist is a coder – Aimee sees her strengths more on the communication side (and from the quality of her presentation, I think we’d all agree!).
Bart also talked about the breadth of skills in Data science, recognising that the ‘sexiest job of the 21st century’ covers many different fields, data sizes and usages for that data.
So instead of looking at it from the beginning, have a look at the job you want and look at what is being asked for. You probably won’t have all the skills so start mapping out your journey and find jobs that act as stepping stones to your dream job.
To help with this process, Bart split data science into four dimensions which he calls the ABCD of data science. Starting in (reverse!) order:
- · D – Development: Development in data science is like no other discipline out there. If you are looking for a quiet life where you learn, apply and regurgitate, you will come out disappointed. Instead you have to learn how to learn and stay informed! There are some great TED talks on how to improve your learning and sign up to a few of the data-science newsletters and meetups. A few that Bart is currently subscribed to include Data Elixer, KDNuggets, DataScience Weekly and Wired (to keep abreast of technology trends generally).
- · C- Capabilities: Without getting dragged into the debate over whether R, Python, Matlab (Bart’s go-to tool) or any of the other dozens of languages are the best, Bart highlighted that whilst you do need to learn at least one analytic programming language well, the skill is not so much in knowing the language but rather in being able to express yourself in it. In the same way as having a paintbrush doesn’t make you a Rembrandt, knowing a language doesn’t mean you can use it. This skill of expressing your thoughts and ideas via the medium of the programming language is what is crucial and, luckily, a transferrable skill across multiple languages! In terms of which one to start with, Python has become one of the most popular and is a good bet because it has a huge library for analytics and is widely supported for integration with other platforms. It is also a simple language with only a few basic rules making it easy to learn. Don’t underestimate either the time commitment required - learning a skill is expensive with some claiming you need 10,000 hours to get good at it! – or mastery of the fundamentals; Bart highlighted that one of his first skills which has helped him in his career is that of touch-typing! Whether you are coding, writing presentations, creating dashboards or manipulating data, the keyboard is going to be your instrument so learn how to play it!
- · B – Beliefs: Belief may sound a little out there, but it stands for intuition, the underlying knowledge, the theories, maths, information theory, system theory, learning theory, even theory of evolution. There is a vast universe of knowledge out there that helps with organising and structuring information and that can provide inspiration. Here again you can leverage your existing knowledge and use it in ever increasingly creative ways. As a foundation however, you cannot do without some knowledge of linear algebra and statistics and some learning methods. What has also become more and more clear is for data scientists to excel, they require domain knowledge so having worked in a specific sector can really help you. It helps with understanding the data you have, how to interpret it and the mechanisms behind the scenes that generate the data. All valuable parts in trying to use data
- .· A – Attitude: The ability to communicate what you know in a pertinent way for the business, being able to simplify it enough so you don’t need a PhD just to understand what was done and carefully listening to what is asked (and then answering with what they want to know) will make the biggest difference in your career progression in most companies. The attitude of scientist in data-scientist makes a huge difference in the analytics part too. Science has progressed as much as it did because it starts with not knowing and being curious to find out. Only when you don’t know something are you open to new discoveries and insights. Being open and curious allows for the best way of ‘letting the data talk’ instead of imposing your own beliefs on the data.
Rishi Sapra (Senior BI Consultant, Altius)
My advice on how to educate yourself in data science is actually to not start with educating yourself in data science, but to start instead with educating yourself in the more fundamental skill sets of data analysis and reporting. This is not because these are more interesting topics, or because they have better career prospects - data science probably trumps them on both fronts – but rather because in my view every company, and correspondingly every data analyst through their career, needs to respect the Analytics Maturity Curve, an example of which is here:
What this shows is the type of data analysis you start with performing in a company, and then - as the organisation develops a more data-driven culture, as data becomes more embedded within every team and the data becomes good quality and becomes trusted -firms start to be able to perform more complex types of analysis and as result move further along the maturity curve.
At the most advanced end of the curve, on the far right, we see things like machine learning and decision automation. This is the bread and butter of companies like Microsoft, Amazon and Google on one hand and tech focused start-ups on the other; there are a lot of businesses where their entire business model is around harnessing cognitive data analytics.
And if you read the tech media, you’d be forgiven for thinking that that’s what every company in today’s age is doing, but the reality is that the many of the traditional bricks and mortar companies – including the vast majority of the companies which make up the FTSE 100 today – aren’t doing this kind of thing on any real scale just yet. They’re dipping their toes in small isolated projects that they then try and make mainstream (and at Altius we have a thriving data science team to help them do this!), but for the majority of my clients, data science isn’t yet an integral part of their business model.
And I think there are 3 key reasons for this
1. The data within the organisations isn’t yet centralised; it sits in pockets across the company, usually in some combination of a vast number of corporate databases on one hand, and spreadsheets on peoples' computers on the other
2. The data isn’t always in a format or the quality required for this kind of analytics. The sources which capture this data aren’t typically sophisticated enough to capture the rich set of attributes that would be needed to do put it through a machine learning algorithm. The truth with cognitive analytics is that it’s only ever as good as the data feeding it – no matter how clever the logic is, it’s not exempt from the universal rule of ‘Garbage in, Garbage out’
3. It’s still ‘Dark data’. This is a term coined by Professor Whitehorn, who’s a previous popular speaker at LBAG, and he uses to describe data that is invisible in the organisation. It may exist in transactional systems but because it hasn’t been used by the business for basic analytics and reporting yet, then it hasn’t yet earnt the trust of the stakeholders. The value in this data hasn’t yet been proven for simple use cases, so to try and use it in more complex ones such as machine learning and decision automation is too risky; no-one’s going to trust a decision automation tool which uses data that hasn’t at least been presented to management for human decision making first!
So if those scenarios exist in the organisation you work for, and you want to help the company overcome those issues and become more mature from a data analytics perspective, where do you start?
Well the truth is, you start as low down the curve as you need to. And there’s usually work to be done at every stage of this curve. Start in one particular team, getting hold of a sample of this dark data, a few CSV extracts from the transactional database maybe, and put it into a reporting and analytics tool – a data visualisation like Power BI, Tableau or Qlikview, works well for this. The data can be cleaned, transformed and modelled either using the data visualisation tool itself or a third party ETL data wrangling tool like Alteryx. Once you have clean data ripe for insights, it’s also very quick using these tools to create business insights – the second point on the first element. You can do historical analysis, highlight the distributions of the data by various different attributes – over time, or by region, product category; whatever type of descriptive attributes you have in your data.
All the while you do this, you try and embed a better understanding of the data amongst the business stakeholders – you establish them as the owners of the data, you provide them data dictionaries and give them access to use the data models you build in tools such as Excel with which they’re most familiar. Most importantly you ensure you have a single version of the truth and you restrict who can see the data by deploying proper Governance.
Once you’ve started to embed self-serve analytics in the organisation, and users start to not just understand but also trust the data, then you can look at things like benchmarking and more advanced reporting. People start to place reliance on the data, and use these tools to alert them automatically when a KPI crosses a certain threshold – that’s something I know you can set up in Power BI and probably in other tools too. And at this stage you’ve productionised the data feeds; they’re maintained in a data platform which is maintained by IT by SLAs.
Then when the data is being used for business insights, and is understood and trusted by the business, you can start to perform some more advanced statistical analytics on it. Again, using Power BI as an example, you can start to embed bits of R code into the data prep or the visualisation level to get (for example) predictive analytics on it, and you can use supplement your report with a tool like Power Apps to do scenario planning.
Finally, you can start to get serious with the cool data science stuff, You can productionise your R or Python scripts, or write them from scratch in a UI, using a tool like Azure Machine Learning. ML tools like this give you the ability to run scripts on your data as its feeding through the data platform and obtain the results of them through exposing the logic as an API which can you can even call from Excel. Now you’re firmly in the realm of changing the fundamental business model of the organisation using cognitive analytics.
To summarise, if you’re lucky enough to work for a company and in a team that is already at the far-right end of the data analytics maturity curve then that’s great; get stuck in. If the company you work for, or the clients you work with as a consultant, need to get some of the fundamentals right first then go on that journey with them. The skills that you’ll hone going through this curve are valuable in their own right for your career as a data scientist or data analyst. Moreover, getting some of these fundamentals right first will make the job of data science much easier because you’ll have good quality, trusted data to work with.
So you don’t need to have the skills to do all this cool, complex stuff straight away; an awareness of what’s involved is helpful to keep your eye on the path of where you’re going to, but building on some of the softer skills around business analysis, data governance, data prep and data modelling are going to help you along the way.
Noor’s presentation was full of practical advice on how to learn the fundamental skills of data science, including getting to grips with the often dreaded statistical elements.
He started with a Venn diagram showing data science at the intersection of Maths/Stats, ‘Hacking’ skills and expertise. As this illustrates, building experience in coding data science without having the fundamental stats knowledge is a danger zone!
Noor also emphasised that learning data science is a continuous journey. He’s currently completing a classroom structured learning course at Birbeck University (good luck with your exams Noor!) and talked fondly of not just the skills he’s learnt but the whole experience of building a network, making friends and being able to learn from others’ experiences of practical applications of data science. Despite the ease of access and low cost of online courses (many excellent ones are available for free), I agree with Noor that there is still value in traditional classroom-based learning!
Massive Online Courses (MOOCs) are freely available and should be utilised to supplement learning too though - Noor pointed out some of the key (Microsoft-orientated) data science ones on edX including ‘Principles of Machine Learning’, ‘Introduction to Python: fundamentals’ and ‘Analysing Big Data with Microsoft R’.
Datacamp is a very popular learning platform for data science. There are some basic introductory courses available for free and you can have access to the full range for a small monthly subscription fee which is well worth it! The advantage of Datacamp from what I’ve seen is the ability to be hands on with coding using an online code editor, and the availability of courses which combine industry knowledge with data science fundamentals (e.g ‘Applied Finance with R’). As Bart highlighted, just knowing how to code in R or Python for example isn’t enough - you need the domain knowledge to know what you need to achieve through the coding too!
Apart from that, the depth of content available through books is valuable, at least as a reference if not for full a sit-down read. Picking up the stats/maths knowledge is particularly good through books (rather than just watching someone present on the topic) and Noor pointed out a great one that’s available for free as a PDF on the internet here.
For printed books, any of the ones from O’Riley Media work well, including ‘Python Data Science handbook’, ‘Data Science from scratch’ and ‘Data Algorithms’.
Lastly, Noor shared his top secret tip - find a mentor/guru to learn from. I wouldn’t say I have one myself but my experience of reaching out to people (particularly senior folk) is that they’re flattered to be asked to share their experiences informally through regular catch-ups and want to give back to the community they learnt from themselves. So don’t be afraid to reach out to people (e.g on LinkedIn) who have gone down a similar path to the one you’re on.
Senior Director - Insights & Data | Azure Data & Databricks Leader
6 年Thanks Rishi for the real picture of the Data Analytics Journey for the Company and the growth path for Data Analytics to Data Science.
Lean and Digitalization Advisor
6 年Thanks Rishi for a detailed narrative including supportives. It surely helps.
Facility Management Consulting | FM Services | Asset Management | FM Strategy | Workplace Services | FM Software
6 年Outstanding! Data science conversations are popping up more, and more in business.