登录查看更多内容

Breaking BERT?—?How to break into Machine Learning

Pascal Biese

Daily AI highlights for 60k+ experts ???? AI/ML Engineer

发布日期: 2019年6月3日

In my time as a Data Science novice, I had seen a lot of posts and articles giving advice to potential talent and newcomers to the field. Actually, there has been a plethora of such articles during the last two years. Some of them were helpful to me, others were not and quite a few started well but ultimately struggled to make things easier for the reader. Don‘t get me wrong, I learned a lot from these articles, but I had to read a dozen of them and combine the different helpful aspects to eventually get a clearer picture.

So let’s start things off: It will be impossible to learn every single aspect that is commonly referred to as “Data Science” skills. Data Science is a vast, poorly defined field with huge gaps between businesses and academia, between different industries and ways of thinking — think of a bunch of varying camps; Data Science is like a big settlement. As such, I think it‘s impossible to create a single article that covers every important advice. I’d still hope that this article at least makes it clearer to the readers what should be the next steps in their journey. Even if it just means rethinking their career path and refining the roles they aim to fill in. More concretely, in my opinion, the most common hindrances to effective guidance within the Data Science and AI field are:

1. Potential skills required are tremendously wide if you count them all out and people get carried away by this easily — even when the purpose of their guidance was to only mention the ?most important skills”.

2. Some authors try to appeal to the whole field of Data Science, which is a vast and poorly defined area. But what Data Science is — at least for companies; what they want and need — can vary from area to area, from field to field and often from person to person. You will find that this guide is strongly biased towards Deep Learning. It‘s also somewhat limited to the area I live in. You may find that companies in your area do things completely differently and I will not be able to appeal to every aspect in this article.

Data Science Venn Diagram v2.0 by Steven Geringer Raleigh (see https://www.kdnuggets.com/2016/10/battle-data-science-venn-diagrams.html for a review of its somewhat controversial history).

Now that we got this out of the way, let‘s start for real!

Above you can see one of the popular attempts to describe Data Science with the help of Venn Diagrams — which is a more heatedly debated topic than you might think — and I included the picture only as a first visual guidance to the reader. What I will be focusing on is the area between Computer Science and Mathematics (Statistics). This area is commonly called “Machine Learning” but often you will find job postings that are labeled as “Data Scientist”. If the job description is written in a useful way, it should become clear, however, if they are looking for a Data Analyst, Data Engineer or Machine Learning Engineer. Some companies are looking for everything at the same time — without knowing the differences — and that’s usually not a good sign.

Great, so we settled with the Machine Learning Engineer, and that’s what we want to become. Some people will tell you to start with the basics, to (re-)learn Linear Algebra, Calculus, Statistics, Machine Learning 101 and so on. Some will tell you that you will have to choose between R and Python. Maybe even learn various database languages and how to manage Big Data “at scale”.

While most of those skills are important, it really depends. The most important skill — and that, you can not simply learn — is the habit of constantly learning, improving and rethinking huge parts of what you are doing. Especially in the Deep Learning area, knowledge is expanding at an immense pace and if it’s not in your nature to keep up with things closely, you will not be happy with your career in the long run. On the other hand, if it does lie in your nature to constantly learn and improve, you can always catch up with requirements. A lot of people will disagree with me when I say this: You don’t need to understand the Math behind Deep Learning to get started. You don’t need to be an expert Statistician to build reasonable Machine Learning systems. And the list goes on. To this day, I have never used the SQL basics I had learned in preparation. In other jobs — actually, almost all of them if you are working in finance or similar industries — SQL is mandatory. The time, however, to learn SQL or rekindle High School Math compared to learn “programming” or get an intuition for Machine Learning is worlds apart. At some point, you should be capable of most things I listed up there and much more. But if you ask me, you should get your priorities right and start to get practice as early as possible.

Learn Python. Just do it.

If you want to work with Deep Learning, learn Python. I won’t even discuss this since personally, I started with R and I still like the language, but it’s frustrating to be in an environment where virtually everyone else uses a different language as default. With Python you can’t go wrong and I haven’t talked to a company yet that demanded R for Machine Learning or was amazed by my knowledge of it (Note: If you wanted to work in the Data Analyst role I mentioned earlier, this might change in some cases. But I will not cover that here). At the beginning, learning to read and write common Machine Learning Python code should become your highest priority. I would recommend taking one of the great Machine Learning and/or Deep Learning courses out there and to not(!) rush through it. Take your time, let it sink in and make sure to code, code, code. This is something where I have erred myself, I went to courses too quickly and coded too little — Python will be your way to talk to your machine, imagine not being able to communicate fluently with your co-workers and you will realize why at some point, it will become a real hindrance really quickly if you can’t express yourself. As every language, it will take time to be fluent and this is something that should not be underestimated. There is no real minimum criteria, but a number often floating around is “1 year of programming experience”. The ultimate pace will depend on a lot of things, some people it will take 6 months, in other cases it will take 2 years or more to reach a fluency where you can really work on things effectively.

In 2019, “TensorFlow or PyTorch?” is the new R versus Python

For quite a while, TensorFlow had dominated the scene as the default Deep Learning library you had to be fluent in. There were other libraries, some before TensorFlow even was a thing, but if you had to pick one for popularity, TensorFlow was clearly the winner the past few years. Recently, this dominance has been shaken by a comparatively new competitor: PyTorch (which is Torch for Python, basically). I won’t go into the details here — and both frameworks are perfectly fine — but what you have to realize when breaking into the field, when you’re looking for code, articles and courses, is that these two distinct libraries exist and that you might want to stick to one at a time as a beginner. Especially for courses it can be quite annoying when you just got somewhat fluent in one of the frameworks and then you start a consecutive course that utilizes the other one. If you’re having trouble to decide, here’s a very inaccurate primer on the situation in early 2019: TensorFlow is still more wide-spread in total, especially among practitioners, but PyTorch has been catching up quickly — mainly because it has become more popular in research (outside of Google). In addition, you can also try to look at job postings in your area and compare the mentions of TensorFlow and PyTorch to help you make a more informed decision. Personally, I went with the course(s) that I found the most appealing when I started out and used whatever they used. But in hindsight it was quite annoying to have to switch around frameworks when I wasn’t fluent in one of them to begin with.

If you’re taking one of those courses, you will also learn everything you need to know about Machine Learning to get started. Some of the instructors will focus more on the Math and Statistics behind it, some will do so less. If you’re taking a good course in a reasonable pace and keep on practicing Python together with either TensorFlow and/or PyTorch, you should have learnt the basics you need for the next step in your career: Specialization.

Don’t be a jack of all trades — unless that’s what you want

There are Data Science and Machine Learning generalists, but for the more common cases, I would highly recommend to specialize within either an area of industry and/or type of data. For data types, the most popular Deep Learning fields are Computer Vision (CV) and Natural Language Processing (NLP), but for a lot of industries and job positions, you will mostly work with tabular data. Those are also the jobs where classical Machine Learning knowledge will become more important. Popular areas to specialize in are Finance, Life Sciences (Biology, Chemistry, Medicine) and “Industry 4.0”, which includes things such as process optimization and error detection. The skills needed to excel in each of these industries can highly vary — in Finance, for example, you will much more often work with tabular data and classical Machine Learning methods and (probably) do a lot of Data Visualization.

If you already know that you’d want to work in a certain area, I would highly recommend to look at relevant job postings, articles and other informative resources in order to find out what companies want and need their applicants. I’d also recommend to start talking to people in the respective fields — be it online or in your local area — to get a better feeling for the requirements and also, most importantly, you will start to feel which ideas resonate with you and which don’t. By now you should know: The basics of Machine Learning, Statistics, Deep Learning and what you want to specialize in. Keep in mind that you can always decide to change directions later in your career, but in order to become useful as a Machine Learning Engineer, you’ll need to become reasonably good in at least one of the disciplines.

For the next step, I will only cover the Natural Language Processing side of things in detail. I will try to keep advice general, however. So now that you have chosen a specialization, you should get up-to-date in the field. For me, that meant taking another course that was specifically tailored to Natural Language Processing. You don’t have to rely on courses, you can also do other things if you don’t like them. The important part is that you keep practicing to express yourself in relevant code (analyzing data relevant to your field, building up-to-date models and so on). I started to be active on Kaggle, read a lot of articles and try out different Deep Learning architectures. If you haven’t started to use GitHub, I would advice you to start now. Save every task or project you did, no matter how small, and upload it to your GitHub. This is another point that I started to do very late. I rarely saved anything from the coursed I had taken and the experiments I ran. When I then started to apply for jobs, I had no code to show despite having coded over two years in R and Python combined. You don’t have to save everything, but you should start to build a portfolio early — and don’t be afraid, nobody will laugh at your “bad code”. If Python and Machine Learning basics are the most important things to learn and Deep Learning knowledge the second, then I would rank portfolios third. Don’t search for excuses to get practical — don’t be me when I was in your shoes. While I got invited to job interviews without a portfolio, relevant code to show would have made things a lot easier.

Are we there yet? Be patient.

So, keep this in mind: No matter what you want to do in the end, you need to able to express yourself in code. By coding and experimenting with models and data, you will also gain an intuition for Machine Learning and/or Deep Learning. If you keep doing that, upload (small) projects to your portfolio and back it up with theoretical knowledge from courses, articles and research papers, you will — with time — become a Data Scientist. In our case here, a Machine Learning Engineer specialized in a high-demand area. Sometimes you won’t be able to feel your own progress, it may even seem like you’re barely improving at all. But don’t stress yourself, Data Science is not a sprint. To form intuition takes time and so does knowledge to sink in and build upon itself. If you keep doing what I explained in this article, at some point, you will become capable in Machine Learning and faster so than most people that compare to you.

The reason? You stuck to your plan. You set your priorities instead of getting distracted by less important things; you practiced coding when they were lost in choices; you finished courses slowly but steadily when they were starting five different ones just to drop them a few weeks later; you got in touch with people when they were silently brooding about their next hypothetical steps. They — they are like me when I was you. But you, just like I did, can take the chance to become the me I am now and beyond. Motivational speeches put aside: It’s all about practice. Get out there and do it.

Additional resources:

Another great guide that is more general and covers some things I didn’t talk about in detail, e.g. networking, pitching: (https://towardsdatascience.com/breaking-into-data-science-in-2019-889111e5c34f)

Link to poor BERT that I used for my headline: https://arxiv.org/abs/1810.04805 (Note: Originally, I wanted to go into more detail on state-of-the-art Natural Language Processing but decided against it since I wanted this article to be useful for a broad audience)

Selection of online courses that I either participated in myself and/or rate highly:

Deeplearning.ai — I can highly recommend their Deep Learning Specialization

fast.ai — Probably the best course out there if you’ve already got a bit of experience in Python

https://web.stanford.edu/class/cs224n/ — One of the best NLP classes worldwide (if you like traditional University style lectures)

In general, I can recommend most if not all of the Stanford classes on Machine Learning and Deep Learning — the same goes for Carnegie Mellon University (CMU). Their lectures (and most of the assignments) are publicly available and highly up-to-date.

Mind you: There’s an abundance of great courses in 2019, a lot of them are available to audit for free, which is a luxury I didn’t have back when I started. Make use of it. Additional examples of platforms to look at are: edX, Coursera, Udemy and Udacity — but there are many more.

Breaking BERT?—?How to break into Machine Learning

Pascal Biese

Daily AI highlights for 60k+ experts ???? AI/ML Engineer

Now that we got this out of the way, let‘s start for real!

Learn Python. Just do it.

In 2019, “TensorFlow or PyTorch?” is the new R versus Python

Don’t be a jack of all trades — unless that’s what you want

Are we there yet? Be patient.

更多精彩文章

社区洞察

其他会员也浏览了

How a Neural Network Sees a Cat, 5 SQL Data Wrangling Techniques, and a 70% Discount to ODSC West

The biggest misconception in learning the mathematical foundations of data science which no one tells you is ..

The six most painstaking steps in machine learning – what your team isn’t telling you

The Metamorphosis of Data Science: From Data Wrangling to Holistic Problem Solving

Vector Indexing plus Knowledge Graphs with Neo4j

Top 2016 KDnuggets Stories: Must-Know Data Science Interview Q&A, 10 Algorithms Machine Learning Engineers Need to Know

Hypothesis Testing in Machine Learning

KDnuggets 17:n05: 5 Career Paths in Big Data, Data Science Explained; Identifying Better Predictors

Responsible Data Science Framework: Techniques, Algorithms, and Fairness for Insightful Analysis and Ethical Practices

Ten predictions for data science and AI in 2020

Now that we got this out of the way, let‘s start for real!

Learn Python. Just do it.

In 2019, “TensorFlow or PyTorch?” is the new R versus Python

Don’t be a jack of all trades — unless that’s what you want

Are we there yet? Be patient.

?? Actually Open AI: A Free o1 Alternative

2024年11月22日

?? The Future of Designing AI Agents

2024年11月15日

?? HTML > Plain Text for RAG

2024年11月8日

?? All You Need to Know About Small Language Models

2024年11月1日

?? Is AI Capable of Reflection?

2024年10月25日

??? GraphRAG Evolves into StructRAG

2024年10月18日

?? Fixing AI's Energy Consumption

2024年10月11日

?? Chasing o1: Closing the Reasoning Gap

2024年10月4日

?? LLMs Are Improving Themselves

2024年9月27日

?? A New Neural Architecture (Again)

2024年9月20日

社区洞察

其他会员也浏览了

How a Neural Network Sees a Cat, 5 SQL Data Wrangling Techniques, and a 70% Discount to ODSC West

The biggest misconception in learning the mathematical foundations of data science which no one tells you is ..

The six most painstaking steps in machine learning – what your team isn’t telling you

The Metamorphosis of Data Science: From Data Wrangling to Holistic Problem Solving

Vector Indexing plus Knowledge Graphs with Neo4j

Top 2016 KDnuggets Stories: Must-Know Data Science Interview Q&A, 10 Algorithms Machine Learning Engineers Need to Know

Hypothesis Testing in Machine Learning

KDnuggets 17:n05: 5 Career Paths in Big Data, Data Science Explained; Identifying Better Predictors

Responsible Data Science Framework: Techniques, Algorithms, and Fairness for Insightful Analysis and Ethical Practices

Ten predictions for data science and AI in 2020