How to Perform Bayesian Linear Regression in Python + R
In the previous edition of Data Science Code in Python + R we walked through how to build your own data set with spotifyr. This week we will be performing Bayesian linear regression with this dataset.
One of the biggest challenges to getting started with Bayesian regression is setting up your environment. Datacamp style courses on Bayes tend to gloss over this issue. But before we launch into that let's start with a little motivation. Why on earth would you go to all this trouble of installing Stan and C++ toolchains when you can simply use ordinary classical linear regression?
Why go to the trouble of Bayesian regression
There are several potential reasons for using Bayesian approaches, but one is that you can estimate the probability distributions for your model parameters. This is something that the classical linear regression approach does not give you since it assumes that these parameters are fixed and your data are random. The Bayesian approach is to take the opposite view that the data you have, your priors, are fixed and the parameters are random. This leads to different types of abilities. Of particular interest to me is the ability to estimate the uncertainty in your parameters. This can be very useful for decision making where you can estimate your chances of being correct in your estimates of model parameters. The other is the ability to extend your data simulation skills.
Bayesian regression with Stan
So today we are just going to get you started now that you might have a little more motivation to learn more. Our goals are simply to:
On top of getting your environment ready for Bayesian regression you may also need to learn a new programming language. The overwhelming majority of recommendations I have found are to use Stan, primarily because of its popularity and community support. Stan is a language developed by Andrew Gelman and collaborators. The documentation and installation instructions can be found here.
If you don't fancy launching into learning a new language the R code utilises a package that has abstracted the Stan language away from the user. This will allow you to get started with building Bayesian models quickly. The Python code example will show you how simple linear regression can be performed using Stan.
Installation
You need to configure your R installation to be able to compile C++ code. Follow the links below for your respective operating system for more instructions:
To work with PyStan we also need to install a C++ compiler: gcc ≥9.0 or clang ≥10.0. But the first caveat you need to realise is that, as mentioned in the documentation, installing PyStan on a Windows machine is a little challenging, please don't waste your time. My recommendation for if you are using Windows is to setup Windows Subsystem for Linux (WSL). This video from Learn Stan with Ric will run you through the entire process including gcc installation.
Following the pystan 3.6.0 documentation you can use pip
python3 -m pip install pystan
Installing RStanArm should be far simpler and I recommend you install the current CRAN version following the instructions in the Stan documentation.
The last tip if you are working in Windows is to launch VSCode from WSL using the terminal command >> code. This should install and open a VSCode IDE within your WSL virtual Linux machine. An alternative is to install and launch Jupyter lab but I found running Bayesian simulations with PyStan to be simpler directly in VSCode.
领英推è
Performing Simple Bayesian Linear Regression in R
Once you have completed the installations you should be ready to attempt your first Bayesian regression. To begin gently we will use the RStanArm package which is a high level API for Stan using standard R functions. In the code below we will import the spotifyr data we generated in the previous edition of Data Science Code in Python + R. As you can see the syntax is very similar to a standard linear regression.
To perform to the same task using PyStan on a Jupyter notebook we will need to do quite a bit more work.
Bayesian Linear Regression Using PyStan on a Jupyter Lab Notebook
The first step is simple enough.
But in the next step we need to import stan and set up the code using Stan syntax.
Please refer to the Stan documentation for more details about the Stan language. ChatGPT also does a great job of translating RStanArm code into Stan so you might want to try that approach!
The next thing you need to do is compile the model. As you can see below, we need to import nest_asyncio to run Bayesian regression from within a Jupyter lab notebook. I've used the interactive jupyter window within VSCode for the example below.
If you look at the mean column in the resulting summary_df above you will notice that these parameters are very close to those estimated using RStanArm, but with a lot of additional work!
Summary
If you just want to try Bayesian Regression out I highly recommend the RStanArm package in R. It will get you up and running far quicker than running on Python. If you want to try out the Python example, be prepared for some false starts and a bit of additional pain, especially if you are working in Windows. Next week we will discuss post simulation checks and making predictions with our Bayesian model. If you made it this far, however, take a well-earned break and we'll see you next week!
If you enjoyed this edition of Data Science Code in Python + R like, subscribe and share with your friends who are interested in Bayesian regression.
~ Matt
Awesome work, but I believe it is more valuable for beginers especially to provide all the available steps in R with pure MCMC and not with stan software. Also Bayesian model selection within linear regression it is a very hot topic but very challenging.