Best resources to get started with machine learning and AI
A sample from the MNIST dataset, a classic toy dataset used in image recognition

Best resources to get started with machine learning and AI

Introduction

Purpose and motivation

I often get asked, at work and in class at IE, what are the best resources to get started with data science (DS), machine learning (ML) and AI by people who come from adjacent fields or who have had occasional and limited exposure to the field. This article is a humble attempt to answer that question by providing tools, books and other resources that I have found most useful.

Scope

Mapping ML/AI or, more broadly, data science, is of course a napoleonic goal and is also subject to fashions and changes in nomenclature (e.g. "big data" seems to have fallen into oblivion, even if data is "bigger" than ever). This article presents a high-level collection of resources and tools I find essential for beginners, rather than an academic taxonomy or philosophical effort to organize the field.

For this article, data science loosely refers to the discipline that answers questions requiring (1) the use of statistics (non-deterministic) and (2) computers (beyond pen and paper solutions). ML and AI are subsets of data science, with a focus on ML to avoid the broadness of data science and because AI is an extension of ML.

Programming languages

Choosing a language.

In order to learn data science, one needs to code. The reader should be as suspicious of courses that promise "AI expertise" without coding as they should be of training devices that promise muscle gains with no effort.

Python

This brings us to the first question: which programming language should I learn? Let's just say that Python has won the battle for general-purpose programming language in data science, with R falling behind but still relevant as a beneficiary of its legacy status in statistics and academia. So, learn Python. If you are truly hardcore and want to learn a low-level language (i.e. a language that is "closer to the machine") that is highly performant, learn Rust.

SQL

We are not going to cover resources about databases, but you will also need to learn SQL. SQL is paramount in order to retrieve datasets and feed them into your code, so it is highly recommended even if some of its functionalities can be handled by Python. SQL is a declarative language, which makes it awkward for programmers coming from procedural or object-oriented languages, but it can be mastered much faster than other languages thanks to its limited scope. So, learn SQL: do the drills, practice a hundred queries, and you will be good to go.

Books to learn Python and SQL

I have a personal preference for books (versus online tutorials, etc.) and, in particular, I've found O'Reilly books to be of high quality (by the way, I earn no commissions from this or any other vendor listed in this article).

Online resources to learn Python and SQL

  • Leetcode: the gold standard of interview prep for software developers, this site also includes excellent mini-courses and exercises for SQL, and computer-science style exercise for Python and other common languages
  • Hackerrank: same as Leetcode
  • CodeSignal: similar to the previous one, with a heavier focus on courses (which tend to be too easy) and interview simulation
  • DataCamp: more focused on courses, which also tend to be on the easy side

Development environments:

The world of IDEs (Integrated Development Environments) is subject to preferences and fads even more than that of programming languages; without diving into the topic too much these are two good options to get started:

  • VSCode: this editor, owned by Microsoft, has become one of the most popular
  • Replit: an online IDE rather than a learning platform per se, but still worth mentioning if the reader is curious about getting started quickly in building code

Data Science / Machine learning / AI:

Machine learning can be seen as a precursor of AI or an enhanced version of statistics; in any case it is what you should start learning if you are interested in AI or in data science. I will broadly split the field in Discriminative AI, which is the field of prediction and classification, and Generative AI, which is the field of generating new content based on a "prompt" or query.

Data Science:

Discriminative AI:

The best book I have read is An Introduction to Statistical Learning, which is available for free online and has an R version and a Python version. The expanded version Elements of Statistical Learning is also excellent.

In those books you will learn about regression, KNN, logistic regression, trees, random forests, support vector machines, hierarchical clustering, K-Means, and dimensionality reduction techniques. These are the methods used most frequently when dealing with panel data (i.e. tabular data), which is also the most frequent type of data besides text.

A more advanced but still accesible book is Hands-on Machine Learning with Scikit-Learn, Keras, and Tensorflow. It covers deep learning, which is just the subset of ML that deals with neural networks. Neural networks can be applied to panel data but their real power comes with the generation of text and images, as we'll see in the next section.

Lastly, Explainable AI for Practitioners covers the topics of explainability and interpretability, which are somehow ancillary to the models themselves, but crucial in business environments.

Generative AI:

Everybody talks about Generative AI but, surprisingly, most resources are either too wishy-washy or too specific (i.e. papers). I have found Generative Deep Learning to fill this gap wonderfully.

For the mathematically inclined readers who like to embrace the grind, This collection of papers is an excellent advanced resource.

Bayesian Inference

Bayesian inference or, more broadly, Causal inference, has gained popularity over the past couple of decades. Although not a class of ML per se, it is an important topic adjacent to the ones we are discussing.

  • The Book of Why is a classic introductory book on the topic that is relatively easy to read, by one of the popes of the topic, Judea Pearl
  • Causal Inference for the Brave and True is more dense, but is an excellent free resource that covers the topic with a fantastic balance of theory and practice
  • Think Bayes is also a great conceptual book with tons of examples in Python; as mentioned before, most of the books by this author are great
  • My collague Helmut Wasserbacker has published this resource, which I have not had a chance to review yet but I bet is clear, applicable, and super useful.

Closing thoughts

I hope the above collection was useful, although I know it is impossible to cover all the books, courses, and tools out there, which are fortunately plentiful. Based on my experience as a machine learning practitioner and also as a professor at IE I have tried to highlight the ones that were most useful and that dwell in the goldilocks zone between academia and business practice.

Please comment below based on your experiences and suggestions. Also, feel free to comment on topics you may want me to elaborate on in the future, and to reach out if you need guidance.


要查看或添加评论,请登录

社区洞察

其他会员也浏览了