So you want to be a Data Scientist?
Data Science continues to be the hottest career trend for technical minded people. With the rise of machine learning and businesses continuing to develop data driven and in-house analytics, it is no surprise that data science is receiving so much attention.
So you want to be a Data Scientist? I will assume that you have some STEM technical background that can help you to make the transition from your current career focus. As I have done with many early career Data Scientists, I want to discuss some of the potential pitfalls before you make the leap and offer a little advice.
The first is that Data Science as a profession is still so new that it lacks formal criteria or definition and worse it has a wide spectrum of sub-domains. So it is probably more precise to say, which kind of data scientist do you want to be?
Another aspect worth considering is that the role of Data Science requires a supporting cast that are necessary for a company to even “do data science.” These include management and transport of the data. Cleaning and preparing data as well as understanding the domain in which we are trying to solve problems. All of these combined create an effective Data Scientist or Data Science program at a company.
The data science support system is a necessary but not sufficient condition for productivity. The analytics skills that are required to extract value from that system are what companies rely on talented Data Scientists for. When looking at a career in data science it is important to match your company prospects with this support capacity and your own skill set. For instance, a startup will not have the resources to have multiple supporting roles whereas a Fortune 1000 might depend upon the size of their investment. Therefore, depending upon your own skills it might be more appropriate to start your career (or your learning) in different areas.
My advice is that you need to be realistic about your skill set. If you have a technical background in which you did a lot of statistics, that’s a great start to being a data scientist in a larger team. Whereas combining that with good software engineering skills and data pipelining could net you a position as the sole data scientist at an early stage startup. If there is a mismatch then both you and the company will be disappointed in the results.
What kinds of skills are required for the full-stack data scientist?
- Statistics: At its core all data science requires an excellent grounding in statistics. Regardless of the type of data science role, our analytics are measured in statistical terms based upon accuracy, precision, recall etc. [A]
- Software: I do not mean, “I know MATLAB”. That is just programming. Software involves things like teams, structures, version controls, tests, containers, and deployment. The kind of code you wrote in a PhD program for your dissertation is not what a company is looking for. My general advice, first become an excellent software engineer then become a data scientist. [B]
- Domain: You will not necessarily know the domain going into a company. But you have to be willing and able to understand the problem domain you are working in. If you lack the ability to have an intelligent conversation with the people that might benefit from your analytics, then you will either fail or you will need a support person to assist you.
- Big Data (Optional): Understand how to handle data and processing on systems that do not fit nicely onto your workstation. The tools of the trade of manipulating data at scale. Pipelines, Map-Reduce, and the like are the required tools for doing data science at scale. Your differentiator here is the ability to “deploy” your methods and models into these systems without a DevOps support – the team will thank you. [C]
A subset of these is appropriate for a position in a larger team in which multiple people will be able to assist and overlap. In order to “do data science” a company needs all the above skills before anything productive is going to happen. You have to decide how you fit into that system.
There are other elements that are much more specific to the type of data science you are performing. For instance, if you are doing Natural Language Processing (NLP) you should know the common methods and practices of that subfield. Same goes for image processing or time-series each of which have their own subfield methodologies that you should be familiar with before applying to that kind of position.
What about machine learning? In trying to keep this generic advice, I believe that machine learning fits nicely into data science but not vice versa. I am describing someone who is capable of doing advanced analytics on data. That said, do not ignore machine learning. [D]
Feeling overwhelmed yet?
Not surprising. Make no mistake that data science is a highly technical and highly skilled position. An “entry level data science” role is really a misnomer. Unfortunately, the shiny veneer used to portray data science by companies is not matched by the depth and complexity of what companies do behind the scenes.
Data science continues to be the hottest field and rewards Data Scientists with incredible career prospects. But as you can see the requirements and knowledge are extensive and not to be taken lightly.
Good luck!
COMMENTARY
[A] And by statistics I do not mean probability alone. Understanding confusion matrices, probability distribution functions, Bayesian etc. Oddly I have found that many of the “core” statistics lessons taught in academia are largely irrelevant in a machine learning world… ANOVA, t-test, etc. Basically, in the real world nothing worth investigating is normally distributed and you never have enough data or time for significance tests (unless you are in healthcare… in which case good luck)
[B] I know exactly the kind of code I wrote for my PhD… disgusting. Pick Python or Java as your data science language. Prefer Python. Feel free to scream at me in the comments. I can take it.
[C] Not every data science job requires this. It depends on the scale and scope of the problem the company is trying to tackle. Certainly not a bad thing to have in your tool belt.
[D] This is likely to change going forward. Machine learning tools are becoming so easy to use that not using them will be harder than using them. Just look at scikit-learn. Neglect at your career peril.
Helping Businesses Launch AI & ML Technical Solutions into Industrial Operations
6 年When I fist got into Data Science I thought it would be a natural transition from my background in statistics and engineering. I didn't put too much weight on the coding and computer science requirement. I thought I could just learn enough to get by. But now I know the truth. Knowing in ins and outs of code and being able to maneuver in different environments is pivotal to being productive (in my opinion). Especially when it comes to reading code and figuring out what others have done via all the great open source help on GitHub, Kaggle, and sorts. Plus I have really enjoyed learning it, which is a surprise to myself. Great post Ben!