Article 5. Learning Bioinformatics: A practical Approach
I. Demystifying Bioinformatics Languages: A Guide to Python, Perl, and R for Microbiologists
Bioinformatics, the intersection of biology and computer science, relies heavily on powerful programming tools such as Perl, Python and R to analyse vast amounts of biological data. Perl, Python and R are scripting languages, are easy to read and do not require compiling whereas C, C++, C# and Java are programming languages that require compiling. A comparison of these languages, with the exception of R, was published in 2008 (Fourment and Gilling (2008). A comparison of common programming languages used in bioinformatics. BMC Bioinformatics. Doi: 10.1186/1471-2105-9-82).
This guide introduces you to Python, Perl, and R, the programming languages that power many bioinformatics tools. While in-depth details about these languages can be found elsewhere (often aimed at IT professionals), this guide focuses on a practical approach for microbiologists rather than lengthy descriptions.
By installing and using a few bioinformatics tools, you'll gain valuable experience. Searching for specific information online as you go will deepen your understanding. This "learn by doing" approach will equip you with a foundational skill set and build confidence to tackle more complex bioinformatics tasks.
Here's what you can expect in this article. The focus will be on the installation and usage of mamba, a package manager that creates isolated environments in which different bioinformatics tools can be installed without conflict. R and rstudio will be installed in a mamba created environment. A procedure to install perl packages and subsequently use a specific perl package to create an environment so that a script can be run in that environment will also be provided. Future articles will focus on bioinformatics software on genome comparisons (ANI variants including tANI, AAI variants, pocp etc), workflows / pipelines / nextflow (phantasm, bascort, ribap, miga, metawrap, gtotree, GTDB etc) and R-based analysis.
General-Purpose Languages:
Languages for Statistics and Data Visualization:
Choosing the Right Tool:
The ideal language depends on your project's needs:
Combining Languages:
Python and R can work together effectively. Python can handle initial data processing, while R excels at visualisation and statistical analysis. Tools like rpy2 facilitate calling R functions from Python.
Conclusion:
Bioinformatics offers a range of programming language options. Understanding the strengths of Python, R, and rstudio empowers you to select the most suitable tool for your specific needs and projects. While coding skills aren't always mandatory for using bioinformatics tools, understanding the basics of languages like Perl, Python, R, and rstudio empowers users in several ways:
II. Managing Complexity by Customisation: creating virtual environments for installing bioinformatics software with mamba
Installing bioinformatics tools can be finicky! Different tools often require specific versions of programming languages. This can lead to conflict and compatibility headaches. To avoid these headaches, virtual environments can become your friend. Tools like mamba can create isolated environments for each bioinformatics tool. Within each environment, you can install the exact language versions and software packages that specific tool needs. This keeps everything neatly separated and prevents conflicts with other projects on your system. It's like having a dedicated toolbox for each project, ensuring the right tools are always at hand without any clashes.
Installing mamba as a virtual environment package manager:
Mamba, a rising star among package managers, simplifies the creation and management of virtual environments. Mamba is a re-implementation of conda written in C++ offering the following benefits: parallel downloading of repository data and package files using multi-threading. libsolv for much faster dependency solving and downloading. Mamba is only distributed through Conda Forge, and users must convert base to having the conda-forge channel prioritized.
? For Linux, select and click x86_64 (amd64) installation release to download.
? Once the download has finished, open a terminal
? Change directory to your Downloads directory - $cd ~/Downloads
? Run the installer bash file: $bash Mambaforge-23.11.0-0-Linux-x86_64.sh
? Press Enter to page through the license & accept the license
? Wait for the installer to finish (at least 30 seconds)
? When asked “Do you want to installer to initialize Miniconda by running conda init” then type
“yes”
? mambaforge directory will be installed in /home/usr/mambaforge (aka $HOME)
? To activate the mamba environment, close the terminal and open a new terminal. If mamba has
been installed successfully than you with see (base) with your terminal prompt.
? To deactivate, type “mamba deactivate” or “conda deactivate” and you will return to the normal prompt.
Installing both R and rstudio in a single isolated environment using mamba
Information on R is here- https://www.r-project.org/ and here - https://anaconda.org/conda-forge/r-base/files In case of rstudio, you can either build a version from scratch - https://github.com/rstudio/rstudio, download a pre-compiled ready to use version - https://posit.co/download/rstudio-desktop/#download or install a suitable version that can be installed by the mamba package manager.
At the time of writing (15 April 2024), R4.3.3 was the latest R version available for installation but it was downgraded to R4.3.1 when co-installed with rstudio in the same mamba-created environment due to incompatibility issues between the two. After experimenting with a lot of combinations of R and rstudio, I found that R version 4.3.1 and rstudio version 2022.12.0+355 were the most recent compatible versions that could be co-installed in a mamba-created virtual environment. A recipe for installation of these two versions is given below:
1. Installation:
$mamba create -n R431 --override-channel -y -c conda-forge r-base=4.3.1 conda-forge::rstudio-desktop --file requirements.txt
where:
An example of a requirements.txt file with r-* format python libraries is given below BUT there also other ways of installing R packages - https://www.dataquest.io/blog/install-package-r/
# With a text editor, create a new requirements.txt file which can be installed with the function --file
r-hunspell # a spelling editor for R
r-spatstat
r-devtools #a R package development tool
r-markdown
r-ape
r-phangorn
r-mass
r-matrix
r-reshape2
r-openmx
r-stringr
2. Checking where your R libraries are installed:
#Activate R431 environment:
$mamba activate R431 (if not already activated)
#Start R
$R
# To check the installed R libraries, use the following at the R prompt (indicated by >).
>.libPaths()
You will see the following on your screen. Substitute usr with your user id (your user id can be obtained using: $whoami).
[1] "/home/usr/mambaforge/envs/R431x/lib/R/library"
Note that when launching rstudio, your active environment will automatically point to the correct location of where your R library packages are installed as is the case above. However, if you find more than one path than comment out R_LIBS=….. in the .Renviron file in the home directory as it could be directing R_LIBS= to the wrong R library.
#Now exit R and return to your active R431 prompt:
>quit()
3. Starting rstudio:
$mamba activate R431 (Activate the environment, if not already active)
$rstudio
rstudio is a GUI tool to access your R software and is much more convenient and easier to use than R, especially for those who are new to R. You can find a lot of useful information on rstudio on the web, some of which is free and some of which are paid courses - https://www.datacamp.com/tutorial/r-studio-tutorial.
In summary:
领英推荐
III. Installing different perl packages using perlbrew
Though perl is not used as widely as python or R, there are may times when perl is required. As is the case with python, there are many different perl packages that have been developed over time and a bioinformatics may require a specific perl package for it to work.
1. Check the default perl version that comes with your operating system:
$perl --version
2. Installing perl packages:
Installing a different perl over your operating system perl will break perl and your operating system. Perlbrew, an admin-free perl installation management tool is widely used to install new perl versions. The latest version of perlbrew is 0.98. The perlbrew website is https://perlbrew.pl
Perlbrew can be installed to a default location (refer to 2A) or optionally it can be installed to a location selected by the user (refer to 2B). All the subsequent steps from 3 to 9 are common to both installations (2a and 2B).
2A. Download perlbrew package and install to the usr default location
#Paste the following to download and install with bash
$curl -L https://install.perlbrew.pl | bash
# Wait for the installation to complete and activate perlbrew (must be done each time in a new terminal)
$source //home/usr/perl5/perlbrew/etc/bashrc
NOTE: usr = your login name which can be found with $whoami
2B. Install perlbrew package to a location of your choice (OPTIONAL)
#Deactivate the mamba environment and get back to your normal prompt
$conda deactivate
#Set path of perlbrew to the location of your choice, than download perlbrew and install it. Perlbrew will be installed in the dir selected with PERLBREW_ROOT
$export PERLBREW_ROOT=/pvol/opt/perl5/perlbrew$curl -kL https://install.perlbrew.pl | bash
# Wait for the installation to complete and activate perlbrew (must be done each time in a new terminal)
$source /pvol/opt/perl5/perlbrew/etc/bashrc
NOTE: usr = your login name which can be found with $whoami
The following are common steps to either 2A or 2B:
3. Install cpanm:
You must install cpanm with perlbrew – if you don’t, weird things can happen when you switch perls and try to install tools.
$perlbrew install-cpanm
4. List all available installable perl versions:
# List all available perl packages
$perlbrew available
# perl
perl-5.38.2
perl-5.36.3
perl-5.34.3
perl-5.32.1
perl-5.30.3
perl-5.28.3
perl-5.26.3
perl-5.24.4
perl-5.22.4
perl-5.20.3
perl-5.18.4
perl-5.16.3
perl-5.14.4
perl-5.12.5
perl-5.10.1
perl-5.8.9
perl-5.6.2
5. Select and install a perl package:
There are three different ways of installing a perl package depending on whether you want a mutli- threaded perl package or a non-mutithreaded perl package.
a) Install a non-multithreaded perl package selected from the above available list
$perlbrew install perl-5.32.1
b) If a non-mutlithread perl package has already been installed, than a multithreaded version of the same perl package can be installed:
$perlbrew install perl-5.32.1 --as=5.32.1t -Dusethreads
c) If a new multithread perl package is to be installed than use the following command:
$perlbrew install perl-5.38.2 -Dusethreads -Duselargefiles -Dcccdlflags=-fPIC -Doptimize=-O2 -Duseshrplib -Duse64bitall -Darchname=x86_64-linux-gnu -Dccflags=-DDEBIAN --as threaded-perl-5.38.2t
NOTE: Installing a new perl package will take time. You can track the status of the installation by copying and pasting the prompt that comes up on the screen in a new terminal. In the case of installing 3c:
$tail -f ~/perl5/perlbrew/build-threaded-perl-5.38.2t.log
6. List all the installed perl packages in your computer
$perlbrew list
The following perl packages have been installed in my computer. An asterisk (*) shows the version that is in current use.
threaded-perl-5.39.4
perl-5.38.0
perl-5.36.3
threaded-perl-5.36.1
perl-5.36.0
* perl-5.22.0
7. Select the perl version you wish to use
$perlbrew switch threaded-perl-5.39.4
8. Check the current perl version environment
$perl -version
9. Deleting all perl packages installed by perlbrew
All perl packages are installed in ~/perl5/perlbrew/perls/. Simply delete the directory perl5 to delete all perl versions installed with perlbrew
Ph.D. in Industrial Microbiology(fermentation)
7 个月Thanks a lot sir, this is a great piece of information that too for beginners to intermediate people like me.