From Analysts to Data Scientists
Statement of?Purpose
The goal of this article is to provide some clarity around the expectations of a data scientist, and highlight the skills, tools, and techniques that are essential to Doing The Job.
To that end, this document contains some detailed thoughts about what data scientists do, and how one can learn to do those things (with handy links included).
As always, this is a living document?—?it can, and should, be updated often.
Overall Areas of?Focus
In the first year of data science development one might expect to touch upon a reasonable chunk of the following:
- Data Science languages?—?Python/R
- Relational databases?—?MySQL, Postgres, Aurora, and others
- Non-relational databases?—?MongoDB, Hadoop, Snowflake, and others
- Machine learning models?—?e.g. Regression, Boosted Trees SVM, NNs
- Graph?—?Neo4J, GraphX
- Distributed computing?—?Hadoop, Spark
- Cloud?—?GCP/AWS/Azure
- API Interaction?—?OAuth, Rest
- Data Visualisation and Webapps?—?D3, RShiny
- Specialist fields?—?NLP, OCR and Computer Vision
- Software Engineering?—?Docker, Writing Modular and Reusable Code, Version Control, Testing, Logging
Plan of?Action:
How to transition from Analyst to Data Scientist
First of all, Start with the right philosophy. Ten years ago it may have been acceptable to wait weeks to be sent on a data software course. Those days are long gone. There are fantastic materials everywhere. Make learning constant. Practice skills continually.
Data Science is a mix of practical skills (like software development) and intangible dark arts (like knowing how to ask a good question). Below are some of the steps one can take to prepare for and grow into a data science role.
- Learn to Ask (Meaningful) Questions?—?This is part of the “science” element! Note that “meaningful” doesn’t mean “brilliant,” it means relevant and sharp. What is a sharp question? Here’s a version provided by Microsoft, adapted from Brandon Rohrer
A vague question doesn’t have to be answered with a name or a number. A sharp question must.
Imagine you found a magic lamp with a genie who will truthfully answer any question you ask. But it’s a mischievous genie, who will try to make their answer as vague and confusing as they can get away with. You want to pin them down with a question so airtight that they can’t help but tell you what you want to know.
If you were to ask a vague question, like “What’s going to happen with my stock?”, the genie might answer, “The price will change”. That’s a truthful answer, but it’s not very helpful.
But if you were to ask a sharp question, like “What will my stock’s sale price be next week?”, the genie can’t help but give you a specific answer and predict a sale price.
So, the basic idea is your questions should have explicit answers?—?numbers or labels that specifically capture the thing you want to know. The hard part is learning how to do this all the time, but it’s really about practice.
2. Learn a Programming Language?—?Programming is a necessity in Data Science?—?Python and/or R is where most start. There are enormous numbers of great free training offerings on sites such as Coursera and Udemy.
Programming Languages:
On the numeric side:
AWS Machine Learning Courses: https://aws.amazon.com/training/learning-paths/machine-learning/
3. Develop your statistical skills?—?Of course, it’s not just about programming, there are some core statistical and mathematical principles one should learn. These include things like probability, understanding data distributions, and linear algebra. The combination of programming skills and statistical knowledge is paramount to being able to explore and understand the data you have (and what you can do with it)
Statistical Principles & Concepts:
4. Use effective ways of working?—?Once the foundations are laid start improving your workflow using version control systems such as GitHub for deploying and maintaining code. Consider Docker for containerisation.
领英推荐
5. Utilize software development best practices?—?These will help you make your workflow more efficient through standardized processes; it will also help you understand how to implement your work and foresee implementation issues.
6. Learn to build Machine Learning models?—?This would, ideally, come up organically in your work but because there are no guarantees, you may need to look elsewhere for practice building, testing, and evaluating models. Enter Kaggle. Kaggle tasks are scoped and cleaned but there is no better way to improve model building skills than to do a challenging problem alongside a few thousand other people. Don’t worry about rank. Start with a playground competition and work up from there. Everyone starts somewhere just give it a try.
Lots of people in Kaggle competitions share their code, as well, so it’s a great place to be introduced to new code, techniques, methods, or styles of programming.
7. Follow leaders in the field?—?Some like to refer to the “rock stars of data science”. This group make fascinating contributions that are well worth your time. Keep an eye out for the likes of Geoffrey Hinton, Andrew Ng, Yann LeCun, Rachel Thomas and Jeremy Howard?, etc..
Follow some great DS influencers, and share what you find interesting!
Attend a conference
8. Learn how to Communicate effectively?—?We need to be able to sell our work. In reality, a Data Science project is often only “finished” when someone else has seen it! People love a shiny demo so always work towards something you can show at the key presentation.
Tips for Presenting to an Audience:
Tools for showing off your findings:
- R Shiny
- Python Dash/Plotly
- D3 JS
- HTML + CSS
- Looker / Tableau / etc.. Dashboards
This also includes documentation?—?judicious use of README.md files in git repos or wiki entries discussing plans, results, and/or next steps.
Bringing it all together to solve a?problem
Identify and solve a real problem at Jellyvision, working alongside business experts and data engineers. We are looking for projects that fulfill these components:
Identify a real problem and formulate a question or series of questions that can be solved with data science.
Scope this problem and identify a solution
- ?Determine and obtain the necessary data
- Solution should involve a machine learning model
Ability to iteratively work through the solution
Ability to implement and present the solution
The solution is actually used by the stakeholders and or used to solved the problem
The solution is presented to the organization (with the impact)
Qualities of a Data Scientist*
*According to my experience
Curiosity?—?Naturally curious people are the best scientists. These are the folks who, upon learning something new, seek to understand they why and how that thing is true. People like this tend to generate more questions than answers, which can be a good thing…
Organization?—?… but they also need to know when to scale those questions into smaller pieces, and when to stop asking questions altogether. The downside of curiosity is that it can be infinite, so learning when to ask another question, and when to shelf it for another day, is invaluable.
Creativity?—?This is the other side of the curiosity coin?—?creativity is necessary, because often Data Scientists are asked to do things that haven’t been done before, or they need to adopt a technique from one field to a very different one (e.g., using techniques for predicting earthquakes to predict OTHER rare events). Often, this requires thinking through problems deeply and laterally?—?the answer is not usually clear, so creativity is needed to get through it.
Technical Prowess?—?Obviously, you need to be able to build the things you need to build?—?you don’t need to have an engineering background, but you need the technical know-how to prototype and test. While most discussion within data science focuses around python or R knowledge, this isn’t about any one skill, really; it’s about being adaptable to new technologies, and using the ones you know efficiently.
Healthy Skepticism?—?I usually say that in data science, if it looks good, it’s probably wrong. This is the annoying toddler form of curiosity, where you never stop asking why?—?by maintaining a healthy skepticism, you don’t just accept an answer, you unpack it and try to understand it fundamentally to be able to provide as-concrete-as-you-can conclusions.
Communication Skills?—?The unsung hero of Data Science; being able to communicate a result effectively is a supremely useful, and strangely underlooked skill for a data scientist. One ultimately should strive to be able to present a result or model to a random person on the street, and have them understand what you’re saying. This can be the difference between what you’ve done being widely accepted, or put on a shelf to collect dust.
You may see a pattern forming here?—?curiosity and organization help us to frame the question; creativity and technical prowess are required to get the answer; healthy skepticism helps you determine whether your answer is ‘right’; communication gets the thing in front of the folks who need to know about it!
It’s important to note that these are not replacements for the Data Team Competencies?—?they are supplemental, and are meant to provide insight into what we believe make for successful data scientists.
Data Analytics/Engineering
1 年This is a great guide, Marina!