From Analysts to Data Scientists


Statement of?Purpose

The goal of this article is to provide some clarity around the expectations of a data scientist, and highlight the skills, tools, and techniques that are essential to Doing The Job.

To that end, this document contains some detailed thoughts about what data scientists do, and how one can learn to do those things (with handy links included).

As always, this is a living document?—?it can, and should, be updated often.

Overall Areas of?Focus

In the first year of data science development one might expect to touch upon a reasonable chunk of the following:

  • Data Science languages?—?Python/R
  • Relational databases?—?MySQL, Postgres, Aurora, and others
  • Non-relational databases?—?MongoDB, Hadoop, Snowflake, and others
  • Machine learning models?—?e.g. Regression, Boosted Trees SVM, NNs
  • Graph?—?Neo4J, GraphX
  • Distributed computing?—?Hadoop, Spark
  • Cloud?—?GCP/AWS/Azure
  • API Interaction?—?OAuth, Rest
  • Data Visualisation and Webapps?—?D3, RShiny
  • Specialist fields?—?NLP, OCR and Computer Vision
  • Software Engineering?—?Docker, Writing Modular and Reusable Code, Version Control, Testing, Logging

Plan of?Action:

How to transition from Analyst to Data Scientist

First of all, Start with the right philosophy. Ten years ago it may have been acceptable to wait weeks to be sent on a data software course. Those days are long gone. There are fantastic materials everywhere. Make learning constant. Practice skills continually.

Data Science is a mix of practical skills (like software development) and intangible dark arts (like knowing how to ask a good question). Below are some of the steps one can take to prepare for and grow into a data science role.

  1. Learn to Ask (Meaningful) Questions?—?This is part of the “science” element! Note that “meaningful” doesn’t mean “brilliant,” it means relevant and sharp. What is a sharp question? Here’s a version provided by Microsoft, adapted from Brandon Rohrer

A vague question doesn’t have to be answered with a name or a number. A sharp question must.

Imagine you found a magic lamp with a genie who will truthfully answer any question you ask. But it’s a mischievous genie, who will try to make their answer as vague and confusing as they can get away with. You want to pin them down with a question so airtight that they can’t help but tell you what you want to know.

If you were to ask a vague question, like “What’s going to happen with my stock?”, the genie might answer, “The price will change”. That’s a truthful answer, but it’s not very helpful.

But if you were to ask a sharp question, like “What will my stock’s sale price be next week?”, the genie can’t help but give you a specific answer and predict a sale price.

So, the basic idea is your questions should have explicit answers?—?numbers or labels that specifically capture the thing you want to know. The hard part is learning how to do this all the time, but it’s really about practice.

2. Learn a Programming Language?—?Programming is a necessity in Data Science?—?Python and/or R is where most start. There are enormous numbers of great free training offerings on sites such as Coursera and Udemy.

Programming Languages:

Udemy Python Free Courses

Udemy R Free Courses

On the numeric side:

Andrew Ng machine learning course

Stanford neural network course

AWS Machine Learning Courses: https://aws.amazon.com/training/learning-paths/machine-learning/

3. Develop your statistical skills?—?Of course, it’s not just about programming, there are some core statistical and mathematical principles one should learn. These include things like probability, understanding data distributions, and linear algebra. The combination of programming skills and statistical knowledge is paramount to being able to explore and understand the data you have (and what you can do with it)

Statistical Principles & Concepts:

  1. 5 Basic Statistics Concepts for Data Scientists
  2. A Gentle Introduction to Statistical Data Distributions
  3. Probability Distributions
  4. Bayes Theorem Explained


4. Use effective ways of working?—?Once the foundations are laid start improving your workflow using version control systems such as GitHub for deploying and maintaining code. Consider Docker for containerisation.

  1. Understanding Docker and performing selenium automation
  2. Get started with Github


5. Utilize software development best practices?—?These will help you make your workflow more efficient through standardized processes; it will also help you understand how to implement your work and foresee implementation issues.

  1. Meeshkan Machine Learning and github API
  2. Programming Best Practices?
  3. Best Python 3 Programming Basics
  4. R Courses on Udemy


6. Learn to build Machine Learning models?—?This would, ideally, come up organically in your work but because there are no guarantees, you may need to look elsewhere for practice building, testing, and evaluating models. Enter Kaggle. Kaggle tasks are scoped and cleaned but there is no better way to improve model building skills than to do a challenging problem alongside a few thousand other people. Don’t worry about rank. Start with a playground competition and work up from there. Everyone starts somewhere just give it a try.

Lots of people in Kaggle competitions share their code, as well, so it’s a great place to be introduced to new code, techniques, methods, or styles of programming.

7. Follow leaders in the field?—?Some like to refer to the “rock stars of data science”. This group make fascinating contributions that are well worth your time. Keep an eye out for the likes of Geoffrey Hinton, Andrew Ng, Yann LeCun, Rachel Thomas and Jeremy Howard?, etc..

Follow some great DS influencers, and share what you find interesting!

Attend a conference

8. Learn how to Communicate effectively?—?We need to be able to sell our work. In reality, a Data Science project is often only “finished” when someone else has seen it! People love a shiny demo so always work towards something you can show at the key presentation.

Tips for Presenting to an Audience:

  1. Data Scientist’s Guide to Communicating Results
  2. Communicating data science: A guide to presenting your work

Tools for showing off your findings:

  1. R Shiny
  2. Python Dash/Plotly
  3. D3 JS
  4. HTML + CSS
  5. Looker / Tableau / etc.. Dashboards

This also includes documentation?—?judicious use of README.md files in git repos or wiki entries discussing plans, results, and/or next steps.

Bringing it all together to solve a?problem

Identify and solve a real problem at Jellyvision, working alongside business experts and data engineers. We are looking for projects that fulfill these components:

Identify a real problem and formulate a question or series of questions that can be solved with data science.

Scope this problem and identify a solution

  • ?Determine and obtain the necessary data
  • Solution should involve a machine learning model

Ability to iteratively work through the solution

Ability to implement and present the solution

The solution is actually used by the stakeholders and or used to solved the problem

The solution is presented to the organization (with the impact)

Qualities of a Data Scientist*

*According to my experience

Curiosity?—?Naturally curious people are the best scientists. These are the folks who, upon learning something new, seek to understand they why and how that thing is true. People like this tend to generate more questions than answers, which can be a good thing…

Organization?—?… but they also need to know when to scale those questions into smaller pieces, and when to stop asking questions altogether. The downside of curiosity is that it can be infinite, so learning when to ask another question, and when to shelf it for another day, is invaluable.

Creativity?—?This is the other side of the curiosity coin?—?creativity is necessary, because often Data Scientists are asked to do things that haven’t been done before, or they need to adopt a technique from one field to a very different one (e.g., using techniques for predicting earthquakes to predict OTHER rare events). Often, this requires thinking through problems deeply and laterally?—?the answer is not usually clear, so creativity is needed to get through it.

Technical Prowess?—?Obviously, you need to be able to build the things you need to build?—?you don’t need to have an engineering background, but you need the technical know-how to prototype and test. While most discussion within data science focuses around python or R knowledge, this isn’t about any one skill, really; it’s about being adaptable to new technologies, and using the ones you know efficiently.

Healthy Skepticism?—?I usually say that in data science, if it looks good, it’s probably wrong. This is the annoying toddler form of curiosity, where you never stop asking why?—?by maintaining a healthy skepticism, you don’t just accept an answer, you unpack it and try to understand it fundamentally to be able to provide as-concrete-as-you-can conclusions.

Communication Skills?—?The unsung hero of Data Science; being able to communicate a result effectively is a supremely useful, and strangely underlooked skill for a data scientist. One ultimately should strive to be able to present a result or model to a random person on the street, and have them understand what you’re saying. This can be the difference between what you’ve done being widely accepted, or put on a shelf to collect dust.


You may see a pattern forming here?—?curiosity and organization help us to frame the question; creativity and technical prowess are required to get the answer; healthy skepticism helps you determine whether your answer is ‘right’; communication gets the thing in front of the folks who need to know about it!

It’s important to note that these are not replacements for the Data Team Competencies?—?they are supplemental, and are meant to provide insight into what we believe make for successful data scientists.

Robby Zar

Data Analytics/Engineering

1 年

This is a great guide, Marina!

要查看或添加评论,请登录

Marina Malaguti的更多文章

  • Introduction to Technical Product Management

    Introduction to Technical Product Management

    Introduction Technical Product management is a dynamic, challenging role that requires you to be an expert in many…

    1 条评论
  • Welcome to indieNation

    Welcome to indieNation

    I am trying this thing. I want to kickoff a newsletter in which I talk about things that are interesting to me and my…

  • How To Get SCRUM Done on a Hybrid Data Team

    How To Get SCRUM Done on a Hybrid Data Team

    Our Data Team is an even split between Data Scientists and Data Engineers, for a total of about 12 people. As our team…

    5 条评论

社区洞察

其他会员也浏览了