Want to become a Data Scientist?
Stephen Gatchell
Sr. Director, Data Advisory at BigID ?? Know Your Data. Control Your Data. ? Security ? Compliance ? Privacy ? AI Data Management
The Big data landscape is very difficult to navigate and is well on the way to becoming big data itself just in the number of tools available and growing each day. Firstmark put together a great visualization (see below) and categorized each tool. Data camp developed a visualization around 8 easy steps (I say hard steps) to become a Data Scientist and it motivated me to develop this post. Things can get confusing so I tried to simplify each step and provide some focus around each step. Just keep in mind that development, process and tool set depends upon the use case. There is no single solution.
- Get good at stats, math and machine learning. Statistics and Machine learning are key and the good news is that there are tools to help you do calculations. Your job is to select the appropriate algorithm and interpret the results. Check out Coursera and edX for courses around statistics and machine learning.
- Learn to code. R does a great job around providing a framework (RStudio), has good documentation, provides the capability to generate visualizations with one line of code and is open source (which means you should download it now). Of course the work is getting the data prepped for the generation of the visualization, but that is true of any language.
- Understand databases. PostgreSQL is a good start using pgAdminIII (open source) which is good news for the BI trained analysts since they most likely have worked with SQL in the past.
- Master data munging, visualization and reporting. I may have split these into two separate categories with data munging in one and visualization and reporting in the other.
- Data munging can be completed by using R and/or SQL to clean up the data if you are a data scientist and want to code. There are tools such as Alteryx (commercial product) if you prefer to not code and develop using an end user interface built on R but you don’t have to know R to use the tool.
- There are visualizations created to do some analysis, and then there are visualizations for presenting and telling a story. The analysis visualizations can be created using R for example, but if you are presenting to a customer or Sr. Executive, than D3 (JavaScript in an open source product) in or Tableau (commercial product) are good tools.
- Level up with Big Data. A combination of structured and unstructured data seems to be optimal. For example, for structured data can be stored in Greenplum with unstructured data stored in HDFS (Hadoop Data File System). Using both types of databases allows you to analyze data in HDFS, develop results for analysis, and move the results into a structured Greenplum environment for reporting. Both are open source.
- Get experience, practice and meet fellow data scientists. Meetup is a great app to find events in your local area to discuss topics of interest. Examples include Analytics Club, Spark Networkers, and Techbreakfast. There are public data sources to practice such as Quandl to use and learn from.
- Internship, boot camp or get a job. My perspective on this is from a second career path. There are many Data Analysts, SQL programmers, Business Intelligence professionals, etc. that are now moving into the Data Science focus. The key of changing career paths is to remember that data science is not business intelligence. Reporting on historical data is not the same as developing predictive analytics. Combining structured and unstructured data from internal and external data sources must have a different expectation of data quality, size of the data, and frequency of the data. This is not your star schema SQL transactional data. This is in your face, textual, large scale data that you may not control or have access to the source systems to solve data quality issues. Volunteer for a project to gain some insight from other Data Scientists and begin understanding where you may need to focus and develop specific skills.
- Follow and engage with the community. Twitter follow @alteryx, @storywithdata, @tableau, @MIT_CSAIL, @hmason, @analyticbridge, @cloudfoundry, @schmarzo. LinkedIn groups to follow include “Big Data and Analytics”, “Data Mining, Statistics, Big Data, Data Visualization and Data Science”, and there are tool groups as well. Sign up to solve a Kaggle problem, post on dataisbeautiful.
- Have the personality and skill set (this was added by me and not part of Data camps viz). Curiosities, persistence, collaboration, the ability to listen and to continuously learn are key attributes.
- You have to question things that others may not even see.
- You have to keep knocking down barriers that get in the way (access to data, quality of data, resource availability).
- You can’t do data science on your own. You need someone to keep you on track such as a program manager (this is due to the curiosity in your and the innate ability to follow the shiny light). You need a subject matter expert to help understand the data and provide input on the results of the models. Do they make sense? You need a visualization person to help tell the story.
- Listen to your stakeholders, get frequent input from those stakeholders, and adjust the plan accordingly. If you plan on going off on a tangent because the data led you there that is ok, just be sure to keep the stakeholder updated to ensure expectations are managed. If the data leads to something really cool that was out of scope that took the entire projects budget and schedule, you may have just lost your sponsor for future projects.
- Every day a new product is developed, a new Data Science use case is developed, new discoveries lead to more questions, and a Data Scientist must keep up with all of the changes. The tool set has grown exponentially over the past year or two and will continue to grow. Internet of Things will open up new use cases and we are literally at the tip of the iceberg.
The bottom line to the whole Data Science thing is to have fun, learn a lot and deliver value. Don’t get caught up in all of the tools, processes, and hype and just solve a problem. Focus on your strengths and use a tool or ask for help to address your weaknesses.
Good luck on your journey!
Sr. IT Project Manager with enterprise software and infrastructure project expertise. I drive results on time and on budget updating/upgrading IT systems
8 年Great guidance - great post! Your list is thorough and helpful...
Data, Digital & Technology Leader | Strategy, Operations & Innovation | Life Sciences
8 年Stephen, great post. Like the consolidated list.
VP, Advanced Analytics and Governance at Fidelity Investments
8 年Stephen, great post, and I agree with your list!