Best resources to get started with machine learning and AI
Introduction
Purpose and motivation
I often get asked, at work and in class at IE, what are the best resources to get started with data science (DS), machine learning (ML) and AI by people who come from adjacent fields or who have had occasional and limited exposure to the field. This article is a humble attempt to answer that question by providing tools, books and other resources that I have found most useful.
Scope
Mapping ML/AI or, more broadly, data science, is of course a napoleonic goal and is also subject to fashions and changes in nomenclature (e.g. "big data" seems to have fallen into oblivion, even if data is "bigger" than ever). This article presents a high-level collection of resources and tools I find essential for beginners, rather than an academic taxonomy or philosophical effort to organize the field.
For this article, data science loosely refers to the discipline that answers questions requiring (1) the use of statistics (non-deterministic) and (2) computers (beyond pen and paper solutions). ML and AI are subsets of data science, with a focus on ML to avoid the broadness of data science and because AI is an extension of ML.
Programming languages
Choosing a language.
In order to learn data science, one needs to code. The reader should be as suspicious of courses that promise "AI expertise" without coding as they should be of training devices that promise muscle gains with no effort.
Python
This brings us to the first question: which programming language should I learn? Let's just say that Python has won the battle for general-purpose programming language in data science, with R falling behind but still relevant as a beneficiary of its legacy status in statistics and academia. So, learn Python. If you are truly hardcore and want to learn a low-level language (i.e. a language that is "closer to the machine") that is highly performant, learn Rust.
SQL
We are not going to cover resources about databases, but you will also need to learn SQL. SQL is paramount in order to retrieve datasets and feed them into your code, so it is highly recommended even if some of its functionalities can be handled by Python. SQL is a declarative language, which makes it awkward for programmers coming from procedural or object-oriented languages, but it can be mastered much faster than other languages thanks to its limited scope. So, learn SQL: do the drills, practice a hundred queries, and you will be good to go.
Books to learn Python and SQL
I have a personal preference for books (versus online tutorials, etc.) and, in particular, I've found O'Reilly books to be of high quality (by the way, I earn no commissions from this or any other vendor listed in this article).
Online resources to learn Python and SQL
Development environments:
The world of IDEs (Integrated Development Environments) is subject to preferences and fads even more than that of programming languages; without diving into the topic too much these are two good options to get started:
领英推荐
Data Science / Machine learning / AI:
Machine learning can be seen as a precursor of AI or an enhanced version of statistics; in any case it is what you should start learning if you are interested in AI or in data science. I will broadly split the field in Discriminative AI, which is the field of prediction and classification, and Generative AI, which is the field of generating new content based on a "prompt" or query.
Data Science:
Discriminative AI:
The best book I have read is An Introduction to Statistical Learning, which is available for free online and has an R version and a Python version. The expanded version Elements of Statistical Learning is also excellent.
In those books you will learn about regression, KNN, logistic regression, trees, random forests, support vector machines, hierarchical clustering, K-Means, and dimensionality reduction techniques. These are the methods used most frequently when dealing with panel data (i.e. tabular data), which is also the most frequent type of data besides text.
A more advanced but still accesible book is Hands-on Machine Learning with Scikit-Learn, Keras, and Tensorflow. It covers deep learning, which is just the subset of ML that deals with neural networks. Neural networks can be applied to panel data but their real power comes with the generation of text and images, as we'll see in the next section.
Lastly, Explainable AI for Practitioners covers the topics of explainability and interpretability, which are somehow ancillary to the models themselves, but crucial in business environments.
Generative AI:
Everybody talks about Generative AI but, surprisingly, most resources are either too wishy-washy or too specific (i.e. papers). I have found Generative Deep Learning to fill this gap wonderfully.
For the mathematically inclined readers who like to embrace the grind, This collection of papers is an excellent advanced resource.
Bayesian Inference
Bayesian inference or, more broadly, Causal inference, has gained popularity over the past couple of decades. Although not a class of ML per se, it is an important topic adjacent to the ones we are discussing.
Closing thoughts
I hope the above collection was useful, although I know it is impossible to cover all the books, courses, and tools out there, which are fortunately plentiful. Based on my experience as a machine learning practitioner and also as a professor at IE I have tried to highlight the ones that were most useful and that dwell in the goldilocks zone between academia and business practice.
Please comment below based on your experiences and suggestions. Also, feel free to comment on topics you may want me to elaborate on in the future, and to reach out if you need guidance.