Python vs R: An Introduction to Statistical Learning
A Python version of the well-known book An Introduction to Statistical Learning was released in the middle of 2023, without the fanfare expected at the arrival of something wanted by so many for so long.
Even when the R version was the only one available, its fame transcended language borders, and Python users went to the extent of creating Python versions of the book's R code and exercises. One such labor of love can be seen in this github repository, and there are others too. But the publication of an official version changes everything, and many of the book's admirers will want to know if they should get a copy now.
This article is a comparison of the two versions, intended to help Python users to decide if they should get a copy of the new book. The article is not a review of the book for the simple reason that I am not competent to review it. Note also that, like the R versions, a PDF of the Python version is also downloadable free.
In the rest of this article I use abbreviations for the titles of the two versions, "ISLPy" refers to the Python version, while "ISLR" refers to the R version. I compare the Python version with the original (not the 2nd) edition of the R version as that is the edition most people seem to mention.
Driven Under the Hood: scikit-learn, statsmodels, SciPy, ...
As expected, all code and exercises in ISLPy depend on scikit-learn, statsmodels, SciPy and the usual other Machine Learning (ML henceforth) libraries. But, to keep the book's text and code as similar as possible to the ISLR, a wrapper that hides the names of these underlying ML libraries is used (see ISLP Package). This is helpful for those who are already familiar with ISLR, but may be an impediment for others.
If you are just wanting to learn about Python ML libraries and techniques, then ISLPy may not be a good purchase for you. On the other hand if you are seeking to deepen your understanding of statistical concepts while learning practical Python ML techniques at the same time, then this book is certainly worth considering. ISLPy will be an invaluable acquisition also if you have already read ISLR and have some familiarity with Python and its ML libraries, or are wanting to migrate to Python.
Comparing Chapter Titles
The ISLPy (new Python version) has 13 chapters, while the ISLR (original R edition) had 10. The newly added chapters are:
Unsupervised Learning which was chapter 10 in ISLR has been moved to chapter 12 in ISLPy.
Chapter wise Changes
The text within preexisting chapters also has small changes, and the following sections highlight the most significant differences.
1 Introduction
The chapter's content (examples, figures, text) is almost identical to that in ISLR.
There are minor differences in the text after the heading A Brief History of Statistical Learning. The new ISLPy does not name the people that the ISLR credited for the development of the various methods. It also mentions the advent of neural networks and support vector machines in the 1980s and 90s. The use of Python is mentioned here for the first time.
The text under the heading This Book is also virtually unchanged. However, the last part mentions the R history of the book through two editions (2013 and 2021), and how the increasing popularity of Python led to the publication of ISLPy. It also mentions the Python ISLP package used with the book's examples and exercises.
In ISLR, the text under the heading Who Should Read This Book contained, among other things, this line: Previous exposure to a programming language, such as MATLAB or Python, is useful but not required. ISPy returns the compliment: Previous exposure to a programming language, such as MATLAB or R, is useful but not required!
2 Statistical Learning
There are very few changes in the body of the chapter. The role of Deep Learning (chapter 10 of ISLPy) is mentioned in a few places (e.g. Figure 2.7).
But the Lab (titled Introduction to Python) is, obviously, completely different. It occupies 22 pages (ISLR needed just 9) and touches a diverse range of topics including Python3 basics, Google Colaboratory, Jupyter, NumPy, pandas, and Matplotlib.
The Conceptual part of the Exercises appears to be identical to that in the ISLR. The problems in the Applied part are also the same, but changes have been made to accommodate the Python ecosystem's way of achieving the same (or similar) effects.
3 Linear Regression
The content of the chapter (titles, subtitles, paragraphs, and figures) is substantially the same. But the structure of many sentences has been changed to make the meaning clearer.
The Lab has the same structure, and covers the same topics.
The Conceptual part of the Exercises appears to be identical to that in the ISLR. The problems in the Applied part are also the same, but changes have been made to accommodate the Python ecosystem's way of achieving the same (or similar) effects.
4 Classification
The structure and content of the chapter remains similar, but there are some differences:
5 Resampling Methods
The content of the chapter (titles, subtitles, paragraphs, and figures) is substantially the same as in ISLR.
The Lab is unchanged, and has exactly the same sections as in ISLR.
The Exercises (Conceptual and Applied parts) are unchanged.
6 Linear Model Selection and Regularization
There content of the chapter is almost unchanged, although some sentences have been reworked to express their meaning better.
The Lab has a different structure from ISLR (although similar in topic content) because of differences in the R and Python libraries.
The Conceptual part of Exercises is unchanged. The Applied part is also essentially unchanged, but small adjustments have been made because of differences between the R and Python libraries.
7 Moving Beyond Linearity
There content of the chapter is virtually unchanged, although some paragraphs and sentences have been reworked to express their meaning better.
The Lab has a different structure from ISLR (although similar in topic content) because of differences in the R and Python libraries.
The Conceptual part of Exercises is unchanged. The Applied part is also essentially unchanged, but small adjustments have been made because of differences between the R and Python libraries.
8 Tree-Based Methods
There are two extra subsections: 8.2.4, Bayesian Additive Regression Trees (over 3 pages), and a very short 8.2.5, Summary of Tree Ensemble Methods. But the rest of the chapter is virtually unchanged.
The Lab has an extra subsection for Bayesian Additive Regression Trees. But it is otherwise similar in topic content, with some differences caused by differences in the R and Python libraries.
The Conceptual part of Exercises is similar except for the BART (Bayesian Additive Regression Trees) additions. The Applied part is also similar, but has BART (Bayesian Additive Regression Trees) additions as well as other small adjustments to accommodate differences between the R and Python libraries.
9 Support Vector Machines
The content of the chapter is virtually unchanged, with very minor changes to the text of ISLR.
The Lab is unchanged, and has exactly the same sections as ISLR. The content does however differ because of the differences between the R and Python libraries.
The Exercises (Conceptual and Applied parts) are unchanged.
10 Deep Learning (new in ISLPy)
This is a completely new chapter. ISLR did not cover the material in this chapter, so there is nothing to compare against. Some Python ML books use scikit-learn's MLPRegressor and MLPClassifier classes to illustrate deep learning in code. But this book uses PyTorch instead, making it more valuable as a resource for learning practical techniques.
11 Survival Analysis and Censored Data (new in ISLPy)
This is a completely new chapter, so there is nothing to compare against. A paragraph near the beginning of the chapter is quoted below to give readers an idea of what to expect:
For example, suppose that we have conducted a five-year medical study, in which patients have been treated for cancer. We would like to fit a model to predict patient survival time, using features such as baseline health measurements or type of treatment. At first pass, this may sound like a regression problem of the kind discussed in Chapter 3. But there is an important complication ...
12 Unsupervised Learning (chapter 10 in ISLR)
The new chapter adds "It can also be used as a tool for data imputation — that is, for filling in missing values in a data matrix" near the beginning.
At the end of section 12.2.2 (10.2.2 in ISLR), there is additional explanatory text after the formula 12.5 (10.5 in ISLR). This part has two additional formulas and associated text.
The subsection titled "The Proportion of Variance Explained" within section 10.2.3 "More on PCA" of ISLR gets promoted to section status, becoming section 12.2.3 in ISLPy. The content of that part however remains the same.
Section 10.2.3 "More on PCA" in ISLR becomes 12.2.4 in ISLPy, though the content remains the same (without what used to be the subsection titled "The Proportion of Variance Explained").
Section 10.3 "Clustering Methods" of ISLR gets pushed down to 12.4 in ISLPy, making place for a new topic: 12.3 "Missing Values and Matrix Completion". This is quite a detailed section, and illustrates its techniques using the USArrests dataset.
The Lab in ISLPy covers broadly the same topics as ISLR, but is structured differently. The content also differs because of the differences between the R and Python libraries. There is also one new topic: 12.5.2 "Matrix Completion".
13 Multiple Testing (new in ISLPy)
This is a completely new chapter, so there is nothing to compare against. But here is the beginning of the first paragraph of the chapter to give you a feel of what to expect:
Thus far, this textbook has mostly focused on estimation and its close cousin, prediction. In this chapter, we instead focus on hypothesis testing, which is key to conducting inference. We remind the reader that inference was briefly discussed in Chapter 2.
Retired Senior Technology Manager at Kellogg Brown & Root
4 周I thought Python relied on R libraries so it’s not one vs the other!! Tells you how old I am.