登录查看更多内容

Five tips for getting started in data science programming

Keith McNulty

Leader in Technology, Science and Analytics | Mathematician, Statistician and Psychometrician | Author and Teacher | Coder, Engineer, Architect

发布日期: 2019年8月21日

If you want to be a genuine data scientist, you need to be able to code. There’s no getting around it. Some people don’t like this idea, and a number of companies are already tapping into that discomfort by offering ‘automated data science’ products — we do the coding so you don’t have to. If these are your only toolkit, you are not a data scientist.

The litmus test of a strong data scientist is that they are not scared of any data set or any problem. They might not know how to handle it straight away, and in fact it’s quite common that they don’t. But they know they can find out how, and they can eventually produce neat, efficient, reproducible code to handle the problem if it comes their way again. If you want to be a great data scientist, that’s the mindset you need to aim for.

So much of the inner confidence and quiet competence of a strong data scientist comes from how they learned to code in the first place. If you are just starting out, how you go about those early weeks and months of learning are critical to whether or not you will flourish further down the line. If you take the lazy approach — the how but not the why — you’ll develop habits that will make you less confident and efficient later. If you put the work in early — understand the how AND the why — you’ll gradually start to feel that confidence build and your capability expand faster and faster as the months go by.

Here are five tips to help you make a great start as you embark on your learning.

1. Choose the right learning sources

People learn in different ways. For example, I am not great at video learning. I need a detailed written narrative that I can carefully analyze and understand at a pace that I am happy with.

Avoid sources that are too practical — this means that they show you what to do but don’t explain why it works. If you are copy-pasting a method to solve a coding problem, and you have no idea why the method worked, then you haven’t really learned anything because you will have no idea how to apply that method later if a similar problem pops up again.

Good learning sources invest time in breaking down the underlying logic of a method. The best ones actually encourage you to code the method yourself through nudges and tips, rather than give you the entire thing ready-made. Thoughtful educators will provide follow on questions that require you to take what you’ve learned and apply it to another context in order to establish that you have learned it well.

It’s hard to find all of this in an online module, so I would recommend that you have a written resource for in-depth learning in your language of choice. If you ask friends, colleagues or classmates what they use, make sure that they have a similar philosophy to learning before accepting their recommendation.

2. Get skin in the game

Just as a sports team will try harder if there is a prize at stake, you will learn better if you have an incentive. Incentives are not credits that you can put on your resume if you have completed an online module. Incentives are real achievements that have made your current or future work better and stronger — where you and others can visibly see how things have improved because of the work you’ve put in.

As an example, when I first learned to code, I set my self a challenge on one of my own datasets. It was several hundred thousand lines of data which my colleagues had processed annually via Excel. It was a highly manual effort and was taking longer and longer every year because Excel with struggling with the increasing size of the dataset.

As I learned the basics, I also spent time applying my new learning to this dataset. It wasn’t easy. I made lots of errors and spent long hours trying to work out what was was wrong and how I could fix it. But this trial and error process was important because it forced me to engage with the inner workings of the language I was learning and get a deep understanding of how it worked under the hood.

Several weeks of work led to a fully automated script that could handle these larger and larger datasets with ease — something both I and my colleagues were excited and awed by. The tangible benefits of my learnings were clear, and it gave me the incentive and confidence to continue at pace.

Working on your own dataset which you have a strong familiarity with is one of the most effective ways to put early learning into practice. Avoid random datasets from the internet where you may not understand what the variables represent or kinds of manipulations are sensible and relevant. It’s much better to have skin in the game.

3. Errors are your friend, not your enemy

When you first learn you will make a LOT of errors. But that is a really good thing if you respond to them in the right way.

Whatever your language of choice, error messages can appear terse or unhelpful to the untrained eye, but spend a little more time on them and, nine times out of ten, you’ll get a decent understanding of exactly why your command didn’t work. This is important because if you understand why it didn’t work this time, you’ll know how to make it work next time.

Too many time I see friends and colleagues completely ignoring the text of error messages and coming straight to me or others asking for help. Since I have learned to treat the error message as my best data science friend, often I can take one look at the error and tell them straight away what the problem is.

When you see an error message, pursue it as the primary route to solving your problem. Often it will mention another function or operation and you’ll need to dive into that too to understand what went wrong. All of this is such an important part of gaining a deep understanding of the environment you are operating in.

4. Learn your base language before add-ons

Languages like R and Python benefit from a rich ecosystem of add-ons and packages to help import functionality needed for certain common tasks or problems. But be careful not to jump into these too quickly. These packages depend on their base language and could not operate without it. You will make life more difficult for yourself if you become too dependent on these without having a decent understanding of your base language.

If you don’t learn about how data types and data structures work in your base language, or if you don’t thoroughly understand how your system prioritizes between base functionality and imported functionality, you could end up in all sorts of twists later down the line that you don’t understand how to get out of. Errors will pop up and you will have no idea what they mean. Functions may produce a completely unexpected output that you have no understanding of.

Early on, I set myself the challenge of completing a task in the base language before I then attempted it using add-on packages. At the beginning of my learning journey, when my manipulations were relatively straightforward, this was very beneficial to my understanding of my base language. I recommend this approach to anyone in the early stages of learning.

5. Embrace the community

One of the main reasons I love working in open source data science is its community. Whatever the problem you are facing, there’s an extremely strong chance someone has faced it before and can give you advice to help you learn. No single textbook can hope to cover all the questions you might have as you learn, so the community will gradually become a key resource for you as you advance your learning.

Newbies can be scared of the community, but there really is no reason to be. The biggest reticence is often intellectual. Is this a stupid question? Will I get an embarrassing slap-down? A little bit of though and care on your part can help ease your concerns here.

First, choose your community carefully. If you are a beginner, don’t post questions to a Twitter hashtag that will push them to experienced programmers. Find online groups and hashtags that match your level of development and direct your questions to those folks.

If you are using a more wide-ranging resource like StackOverflow, learn its rules and follow them. If you are a beginner, it’s very very likely that the question you asked has already been answered so search for it carefully before you consider posting it as a new question. If you do post it as a new question, ensure you are really specific and provide a minimal reproducible example of your code. If you post a generic question with no example you are certain to get smacked down and you probably deserve it!

If you do get a response to your question and you think it is too brief — for example someone just posts the code you need without an explanation — don’t be afraid to ask them to explain why it works. Most respondents want to help, and they want to build their reputation on the platform, and so they will usually be willing to expand on their response.

These are just a few things that I recommend if you are on the start of your data science learning journey and you aspire to be a great data scientist in the future. Good luck on this exciting journey!

I lead McKinsey's internal People Analytics and Measurement function. Originally I was a Pure Mathematician, then I became a Psychometrician. I am passionate about applying the rigor of both those disciplines to complex people questions. I'm also a coding geek and a massive fan of Japanese RPGs.

All opinions expressed are my own and not to be associated with my employer or any other organization I am connected with.

Josh Nelson

Co-founder; Former Managing Partner, Mashmetrics, Inc

4 年

Keith, the "why" rigor you advise, while probably not what most want to hear, is refreshing. Thank you. Mayo Racek, MBA

Li Zhong (钟莉), MBA

Leading Digital Enterprise Specialist -Azure Teams @ Microsoft | Empower AI Transformation for Enterprise Customers across Western Europe

5 年

Thank you Keith McNulty for sharing your valuable insights for many here, especially in the subjects of (1) how to learn - both “how” AND “why” when all kinds of add-ons are available (2) REPRODUCIBLE codes for next use (3) COMMUNITY - open source mentality! Thank you!

Lionel Voirol

PhD Candidate Statistics | Research & Teaching Assistant chez Université de Genève

5 年

Thibault Pierotti a great article on a topic we already discussed

1 次回应

Andrew P. P.

HR Analytics Specialist

5 年

Yes, undoing my bad / lazy habits later on has been a somewhat painful experience. Live and learn!

查看更多评论

要查看或添加评论，请登录

Keith McNulty的更多文章

The Italian Origins of Imaginary Numbers

2024年9月23日

The Italian Origins of Imaginary Numbers

If you happened to be taking a stroll around Bologna or Milan in the mid-16th century, it’s possible you might have…

11 条评论
The Beauty of the Binomial Expansion

2024年8月28日

The Beauty of the Binomial Expansion

I’m going to take a sum of two terms a+b and I am going to square it. If you remember from your quadratic expansions…

7 条评论
My Top Tip for Tackling Tough Math Problems

2024年8月21日

My Top Tip for Tackling Tough Math Problems

I recently came across an algebra problem which doesn’t require any advanced math skills to solve, but still takes…

21 条评论
The Three Most Common Statistical Tests You Should Deeply Understand

2024年8月12日

The Three Most Common Statistical Tests You Should Deeply Understand

If, like me, you are not a fan of code formatting in LinkedIn articles, you can also read this article on Medium…

10 条评论
The Trick That Helps All Statisticians Survive

2024年8月6日

The Trick That Helps All Statisticians Survive

If, like me, you are not a fan of the code formatting in LinkedIn articles, you can view this article on Medium. I have…

8 条评论
How To Pipe Real-Time Info Into Your LLM Responses Using Tools

2024年7月31日

How To Pipe Real-Time Info Into Your LLM Responses Using Tools

If you don't want to deal with the poor code formatting in LinkedIn articles, you can also read this article via…

2 条评论
Two Fascinating Properties of the Fibonacci Sequence

2024年7月16日

Two Fascinating Properties of the Fibonacci Sequence

The Fibonacci sequence is a very well known and studied sequence of numbers which is often used in schools and in…
How To Summarize Public Opinion Using RAG AI

2024年7月15日

How To Summarize Public Opinion Using RAG AI

Having now spent almost two years being exposed to the new generation of generative models (starting with chatGPT), we…

5 条评论
The Beautiful and Useful Applications of Logarithms

2024年5月28日

The Beautiful and Useful Applications of Logarithms

Logarithms are among the most useful tools we have at our disposal in mathematics. They allows us to translate problems…

6 条评论
A Primer on Statistical Power and Power Analysis

2024年5月7日

A Primer on Statistical Power and Power Analysis

If your experience is anything like mine, you’ve probably heard numerous people talk about ‘statistical power’ in…

1 条评论

See all articles

Five tips for getting started in data science programming

Keith McNulty

Leader in Technology, Science and Analytics | Mathematician, Statistician and Psychometrician | Author and Teacher | Coder, Engineer, Architect

1. Choose the right learning sources

2. Get skin in the game

3. Errors are your friend, not your enemy

4. Learn your base language before add-ons

5. Embrace the community

Keith McNulty的更多文章

社区洞察

其他会员也浏览了

Mastering Data Alchemy: Best Practices for Python Coding in Data Science

15 Free Data Science Courses You Should Take Right Now

Data Science Curriculum for Total Beginner

10 Best Data Science Courses for Python and R Developers in 2024

Becoming a Data Scientist in Weeks

DataScience Road Map for 2024

Ten Good Coding Practices for Data Scientists

Accelerating Data-on-Demand Services, C++, & Podcast Recommendation

Tidy Production Pandas with Hamilton

Data Analysis using Python: Course Review

1. Choose the right learning sources

2. Get skin in the game

3. Errors are your friend, not your enemy

4. Learn your base language before add-ons

5. Embrace the community

Keith McNulty的更多文章

The Italian Origins of Imaginary Numbers

The Beauty of the Binomial Expansion

My Top Tip for Tackling Tough Math Problems

The Three Most Common Statistical Tests You Should Deeply Understand

The Trick That Helps All Statisticians Survive

How To Pipe Real-Time Info Into Your LLM Responses Using Tools

Two Fascinating Properties of the Fibonacci Sequence

How To Summarize Public Opinion Using RAG AI

The Beautiful and Useful Applications of Logarithms

A Primer on Statistical Power and Power Analysis

社区洞察

其他会员也浏览了

Mastering Data Alchemy: Best Practices for Python Coding in Data Science

15 Free Data Science Courses You Should Take Right Now

Data Science Curriculum for Total Beginner

10 Best Data Science Courses for Python and R Developers in 2024

Becoming a Data Scientist in Weeks

DataScience Road Map for 2024

Ten Good Coding Practices for Data Scientists

Accelerating Data-on-Demand Services, C++, & Podcast Recommendation

Tidy Production Pandas with Hamilton

Data Analysis using Python: Course Review