Python vs R: An Introduction to Statistical Learning

Python vs R: An Introduction to Statistical Learning

A Python version of the well-known book An Introduction to Statistical Learning was released in the middle of 2023, without the fanfare expected at the arrival of something wanted by so many for so long.

Even when the R version was the only one available, its fame transcended language borders, and Python users went to the extent of creating Python versions of the book's R code and exercises. One such labor of love can be seen in this github repository, and there are others too. But the publication of an official version changes everything, and many of the book's admirers will want to know if they should get a copy now.

This article is a comparison of the two versions, intended to help Python users to decide if they should get a copy of the new book. The article is not a review of the book for the simple reason that I am not competent to review it. Note also that, like the R versions, a PDF of the Python version is also downloadable free.

In the rest of this article I use abbreviations for the titles of the two versions, "ISLPy" refers to the Python version, while "ISLR" refers to the R version. I compare the Python version with the original (not the 2nd) edition of the R version as that is the edition most people seem to mention.

Driven Under the Hood: scikit-learn, statsmodels, SciPy, ...

As expected, all code and exercises in ISLPy depend on scikit-learn, statsmodels, SciPy and the usual other Machine Learning (ML henceforth) libraries. But, to keep the book's text and code as similar as possible to the ISLR, a wrapper that hides the names of these underlying ML libraries is used (see ISLP Package). This is helpful for those who are already familiar with ISLR, but may be an impediment for others.

If you are just wanting to learn about Python ML libraries and techniques, then ISLPy may not be a good purchase for you. On the other hand if you are seeking to deepen your understanding of statistical concepts while learning practical Python ML techniques at the same time, then this book is certainly worth considering. ISLPy will be an invaluable acquisition also if you have already read ISLR and have some familiarity with Python and its ML libraries, or are wanting to migrate to Python.

Comparing Chapter Titles

The ISLPy (new Python version) has 13 chapters, while the ISLR (original R edition) had 10. The newly added chapters are:

  • 10 Deep Learning
  • 11 Survival Analysis and Censored Data
  • 13 Multiple Testing.

Unsupervised Learning which was chapter 10 in ISLR has been moved to chapter 12 in ISLPy.

Chapter wise Changes

The text within preexisting chapters also has small changes, and the following sections highlight the most significant differences.

1 Introduction

The chapter's content (examples, figures, text) is almost identical to that in ISLR.

There are minor differences in the text after the heading A Brief History of Statistical Learning. The new ISLPy does not name the people that the ISLR credited for the development of the various methods. It also mentions the advent of neural networks and support vector machines in the 1980s and 90s. The use of Python is mentioned here for the first time.

The text under the heading This Book is also virtually unchanged. However, the last part mentions the R history of the book through two editions (2013 and 2021), and how the increasing popularity of Python led to the publication of ISLPy. It also mentions the Python ISLP package used with the book's examples and exercises.

In ISLR, the text under the heading Who Should Read This Book contained, among other things, this line: Previous exposure to a programming language, such as MATLAB or Python, is useful but not required. ISPy returns the compliment: Previous exposure to a programming language, such as MATLAB or R, is useful but not required!

2 Statistical Learning

There are very few changes in the body of the chapter. The role of Deep Learning (chapter 10 of ISLPy) is mentioned in a few places (e.g. Figure 2.7).

But the Lab (titled Introduction to Python) is, obviously, completely different. It occupies 22 pages (ISLR needed just 9) and touches a diverse range of topics including Python3 basics, Google Colaboratory, Jupyter, NumPy, pandas, and Matplotlib.

The Conceptual part of the Exercises appears to be identical to that in the ISLR. The problems in the Applied part are also the same, but changes have been made to accommodate the Python ecosystem's way of achieving the same (or similar) effects.

3 Linear Regression

The content of the chapter (titles, subtitles, paragraphs, and figures) is substantially the same. But the structure of many sentences has been changed to make the meaning clearer.

The Lab has the same structure, and covers the same topics.

The Conceptual part of the Exercises appears to be identical to that in the ISLR. The problems in the Applied part are also the same, but changes have been made to accommodate the Python ecosystem's way of achieving the same (or similar) effects.

4 Classification

The structure and content of the chapter remains similar, but there are some differences:

  • The summary of section 4.2 has been expanded to make it clearer.
  • The title of section 4.3.5 (which in ISLR was Logistic Regression for >2 Response Classes) has been changed to Multinomial Logistic Regression. The body of this section in ISLPy is much longer, and explains how Logistic Regression can be extended to handle more than 2 classes (multinomial logistic regression). In ISLR, this part merely stated that R included Logistic Regression extensions that could handle more than 2 classes, but that other algorithms (covered later in the chapter) were more popular.
  • The title of section 4.4 (which in ISLR was Linear Discriminant Analysis) has been changed to Linear Discriminant Analysis, though the immediately following text is substantially the same.
  • After the 3 bullet points in section 4.4, ISLR had a subsection (4.4.1) titled Using Bayes’ Theorem for Classi?cation. But this subsection title is missing in ISLPy at this point. This could be an unintended omission because the following text in ISLPy is substantially the same as in ISLR, and includes mentions of Bayes Theorem and the Bayes Classifier. The last part of this subsection in ISLPy has this additional paragraph: In the following sections, we discuss three classifiers that ... approximate the Bayes classifier: linear discriminant analysis, quadratic discriminant analysis, and naive Bayes.
  • New section 4.4.4 Naive Bayes. Occupying nearly 4 complete pages of text including one figure, this additional material should go a long way in augmenting the reader's understanding of the Naive Bayes classifier. It is possible that this subsection is intended to balance the removal of the subsection title mentioned above.
  • ISLPy divides section 4.5, A Comparison of Classification Methods, into two subsections. The first subsection (4.5.1), An Analytical Comparison, appears to contain a much expanded and rewritten version of the early part of ISLR's section 4.5. The second subsection (4.5.2), An Empirical Comparison, contains the text and figures in the later part of ISLR's section 4.5.
  • New section 4.6 Generalized Linear Models. This part of ISLPy is completely new, and includes the following subsections: 4.6.1 Linear Regression on the Bikeshare Data, 4.6.2 Poisson Regression on the Bikeshare Data, and 4.6.3 Generalized Linear Models in Greater Generality. The Bikeshare dataset is new in ISPy.
  • The Lab has almost the same sections as in ISLR, with these differences: there is a new section for Naive Bayes, and the last section, An Application to Caravan Insurance Data, has been replaced by Linear and Poisson Regression on the Bikeshare Data.
  • The Exercises' Conceptual part has 3 extra exercises: 10, 11, and 12. In the Applied part, exercise 14 has an extra assignment based on Naive Bayes.

5 Resampling Methods

The content of the chapter (titles, subtitles, paragraphs, and figures) is substantially the same as in ISLR.

The Lab is unchanged, and has exactly the same sections as in ISLR.

The Exercises (Conceptual and Applied parts) are unchanged.

6 Linear Model Selection and Regularization

There content of the chapter is almost unchanged, although some sentences have been reworked to express their meaning better.

The Lab has a different structure from ISLR (although similar in topic content) because of differences in the R and Python libraries.

The Conceptual part of Exercises is unchanged. The Applied part is also essentially unchanged, but small adjustments have been made because of differences between the R and Python libraries.

7 Moving Beyond Linearity

There content of the chapter is virtually unchanged, although some paragraphs and sentences have been reworked to express their meaning better.

The Lab has a different structure from ISLR (although similar in topic content) because of differences in the R and Python libraries.

The Conceptual part of Exercises is unchanged. The Applied part is also essentially unchanged, but small adjustments have been made because of differences between the R and Python libraries.

8 Tree-Based Methods

There are two extra subsections: 8.2.4, Bayesian Additive Regression Trees (over 3 pages), and a very short 8.2.5, Summary of Tree Ensemble Methods. But the rest of the chapter is virtually unchanged.

The Lab has an extra subsection for Bayesian Additive Regression Trees. But it is otherwise similar in topic content, with some differences caused by differences in the R and Python libraries.

The Conceptual part of Exercises is similar except for the BART (Bayesian Additive Regression Trees) additions. The Applied part is also similar, but has BART (Bayesian Additive Regression Trees) additions as well as other small adjustments to accommodate differences between the R and Python libraries.

9 Support Vector Machines

The content of the chapter is virtually unchanged, with very minor changes to the text of ISLR.

The Lab is unchanged, and has exactly the same sections as ISLR. The content does however differ because of the differences between the R and Python libraries.

The Exercises (Conceptual and Applied parts) are unchanged.

10 Deep Learning (new in ISLPy)

This is a completely new chapter. ISLR did not cover the material in this chapter, so there is nothing to compare against. Some Python ML books use scikit-learn's MLPRegressor and MLPClassifier classes to illustrate deep learning in code. But this book uses PyTorch instead, making it more valuable as a resource for learning practical techniques.

11 Survival Analysis and Censored Data (new in ISLPy)

This is a completely new chapter, so there is nothing to compare against. A paragraph near the beginning of the chapter is quoted below to give readers an idea of what to expect:

For example, suppose that we have conducted a five-year medical study, in which patients have been treated for cancer. We would like to fit a model to predict patient survival time, using features such as baseline health measurements or type of treatment. At first pass, this may sound like a regression problem of the kind discussed in Chapter 3. But there is an important complication ...

12 Unsupervised Learning (chapter 10 in ISLR)

The new chapter adds "It can also be used as a tool for data imputation — that is, for filling in missing values in a data matrix" near the beginning.

At the end of section 12.2.2 (10.2.2 in ISLR), there is additional explanatory text after the formula 12.5 (10.5 in ISLR). This part has two additional formulas and associated text.

The subsection titled "The Proportion of Variance Explained" within section 10.2.3 "More on PCA" of ISLR gets promoted to section status, becoming section 12.2.3 in ISLPy. The content of that part however remains the same.

Section 10.2.3 "More on PCA" in ISLR becomes 12.2.4 in ISLPy, though the content remains the same (without what used to be the subsection titled "The Proportion of Variance Explained").

Section 10.3 "Clustering Methods" of ISLR gets pushed down to 12.4 in ISLPy, making place for a new topic: 12.3 "Missing Values and Matrix Completion". This is quite a detailed section, and illustrates its techniques using the USArrests dataset.

The Lab in ISLPy covers broadly the same topics as ISLR, but is structured differently. The content also differs because of the differences between the R and Python libraries. There is also one new topic: 12.5.2 "Matrix Completion".

13 Multiple Testing (new in ISLPy)

This is a completely new chapter, so there is nothing to compare against. But here is the beginning of the first paragraph of the chapter to give you a feel of what to expect:

Thus far, this textbook has mostly focused on estimation and its close cousin, prediction. In this chapter, we instead focus on hypothesis testing, which is key to conducting inference. We remind the reader that inference was briefly discussed in Chapter 2.


Surajit Dasgupta

Retired Senior Technology Manager at Kellogg Brown & Root

4 周

I thought Python relied on R libraries so it’s not one vs the other!! Tells you how old I am.

回复

要查看或添加评论,请登录

社区洞察