Article 5. Learning Bioinformatics: A practical Approach

I. Demystifying Bioinformatics Languages: A Guide to Python, Perl, and R for Microbiologists

Bioinformatics, the intersection of biology and computer science, relies heavily on powerful programming tools such as Perl, Python and R to analyse vast amounts of biological data. Perl, Python and R are scripting languages, are easy to read and do not require compiling whereas C, C++, C# and Java are programming languages that require compiling. A comparison of these languages, with the exception of R, was published in 2008 (Fourment and Gilling (2008). A comparison of common programming languages used in bioinformatics. BMC Bioinformatics. Doi: 10.1186/1471-2105-9-82).

This guide introduces you to Python, Perl, and R, the programming languages that power many bioinformatics tools. While in-depth details about these languages can be found elsewhere (often aimed at IT professionals), this guide focuses on a practical approach for microbiologists rather than lengthy descriptions.

By installing and using a few bioinformatics tools, you'll gain valuable experience. Searching for specific information online as you go will deepen your understanding. This "learn by doing" approach will equip you with a foundational skill set and build confidence to tackle more complex bioinformatics tasks.

Here's what you can expect in this article. The focus will be on the installation and usage of mamba, a package manager that creates isolated environments in which different bioinformatics tools can be installed without conflict. R and rstudio will be installed in a mamba created environment. A procedure to install perl packages and subsequently use a specific perl package to create an environment so that a script can be run in that environment will also be provided. Future articles will focus on bioinformatics software on genome comparisons (ANI variants including tANI, AAI variants, pocp etc), workflows / pipelines / nextflow (phantasm, bascort, ribap, miga, metawrap, gtotree, GTDB etc) and R-based analysis.

General-Purpose Languages:

  • Python (released in 1991): A versatile, high-level language known for its readability and ease of use. Popular in web development, data science (including data analysis and machine learning), and rapid application development. Libraries like Pandas, NumPy, and scikit-learn empower data manipulation and machine learning tasks.
  • Perl (Practical Extraction and Reporting Language, released in 1987): While its popularity has waned, Perl's ability to handle text-based biological data (common in bioinformatics) remains valuable. It excels at tasks like manipulating long sequences (DNA, proteins) and parsing various text formats (HTML, JSON).

Languages for Statistics and Data Visualization:

  • R (released in 2000): A powerful language specifically designed for statistical computing and graphics. It boasts a vast collection of specialised packages (like dplyr, tidyr, caret) for data analysis, virtualisation, and statistical modelling. However, R is a Command Line Interface (CLI) program and a lot of syntax has to be learnt, which can intimidate those who are new to coding.
  • rstudio (released in 2011): rstudio, an integrated development environment (IDE), is a user-friendly and convenient Graphic User Interface (GUI) for writing, running, and debugging R code. For integration, R has to be installed prior to installing rstudio. It is more convenient to use rstudio than R for beginners in the field. Unfortunately, rstudio is not updated as regularly as R and conflict and compatibility issues which arise during installation have to resolved during installation. More on this below: “Installing both R and rstudio in an isolated environment using mamba”.

Choosing the Right Tool:

The ideal language depends on your project's needs:

  • For general-purpose bioinformatics scripting and text manipulation: Consider Python or Perl.
  • For in-depth statistical analysis, visualisation, and development of statistical software: R is the go-to choice. rstudio facilitates the use of R for beginners.

Combining Languages:

Python and R can work together effectively. Python can handle initial data processing, while R excels at visualisation and statistical analysis. Tools like rpy2 facilitate calling R functions from Python.

Conclusion:

Bioinformatics offers a range of programming language options. Understanding the strengths of Python, R, and rstudio empowers you to select the most suitable tool for your specific needs and projects. While coding skills aren't always mandatory for using bioinformatics tools, understanding the basics of languages like Perl, Python, R, and rstudio empowers users in several ways:

  • Installation: These languages are often used to develop bioinformatics software. Grasping the core functionalities helps users navigate installation processes and troubleshoot potential issues.
  • Customisation: Many tools offer customisation options through scripts written in these languages. Basic understanding allows users to tailor the software to their specific needs.
  • Error identification: Scripting errors can cause unexpected behaviour in bioinformatics tools. Recognising syntax or logic errors becomes easier with some programming knowledge.


II. Managing Complexity by Customisation: creating virtual environments for installing bioinformatics software with mamba

Installing bioinformatics tools can be finicky! Different tools often require specific versions of programming languages. This can lead to conflict and compatibility headaches. To avoid these headaches, virtual environments can become your friend. Tools like mamba can create isolated environments for each bioinformatics tool. Within each environment, you can install the exact language versions and software packages that specific tool needs. This keeps everything neatly separated and prevents conflicts with other projects on your system. It's like having a dedicated toolbox for each project, ensuring the right tools are always at hand without any clashes.


Installing mamba as a virtual environment package manager:

Mamba, a rising star among package managers, simplifies the creation and management of virtual environments. Mamba is a re-implementation of conda written in C++ offering the following benefits: parallel downloading of repository data and package files using multi-threading. libsolv for much faster dependency solving and downloading. Mamba is only distributed through Conda Forge, and users must convert base to having the conda-forge channel prioritized.

? Goto https://github.com/conda-forge/miniforge/releases

? For Linux, select and click x86_64 (amd64) installation release to download.

? Once the download has finished, open a terminal

? Change directory to your Downloads directory - $cd ~/Downloads

? Run the installer bash file: $bash Mambaforge-23.11.0-0-Linux-x86_64.sh

? Press Enter to page through the license & accept the license

? Wait for the installer to finish (at least 30 seconds)

? When asked “Do you want to installer to initialize Miniconda by running conda init” then type

“yes”

? mambaforge directory will be installed in /home/usr/mambaforge (aka $HOME)

? To activate the mamba environment, close the terminal and open a new terminal. If mamba has

been installed successfully than you with see (base) with your terminal prompt.

? To deactivate, type “mamba deactivate” or “conda deactivate” and you will return to the normal prompt.


Installing both R and rstudio in a single isolated environment using mamba

Information on R is here- https://www.r-project.org/ and here - https://anaconda.org/conda-forge/r-base/files In case of rstudio, you can either build a version from scratch - https://github.com/rstudio/rstudio, download a pre-compiled ready to use version - https://posit.co/download/rstudio-desktop/#download or install a suitable version that can be installed by the mamba package manager.


At the time of writing (15 April 2024), R4.3.3 was the latest R version available for installation but it was downgraded to R4.3.1 when co-installed with rstudio in the same mamba-created environment due to incompatibility issues between the two. After experimenting with a lot of combinations of R and rstudio, I found that R version 4.3.1 and rstudio version 2022.12.0+355 were the most recent compatible versions that could be co-installed in a mamba-created virtual environment. A recipe for installation of these two versions is given below:


1. Installation:

$mamba create -n R431 --override-channel -y -c conda-forge r-base=4.3.1 conda-forge::rstudio-desktop --file requirements.txt

where:

  • -n = short for the word “name”
  • R431 = name of the environment you wish to create. (you can choose your own name)
  • --override-channel = no other channels besides conda-forge will be selected
  • r-base=4.3.1 is the R version number.
  • --file requirements.txt. You can install Python packages in R bindings to Python libraries, r-* format, via the --file function. However, it's important to remember that though R and Python are separate and distinct languages, they can still communicate with each other through the bridge package rpy2 (from Python) which allows Python to call R functions and allows R to execute Python code and utilise Python libraries. This functionality, facilitates communication between the two languages, enabling projects to leverage functionalities from both python and R.


An example of a requirements.txt file with r-* format python libraries is given below BUT there also other ways of installing R packages - https://www.dataquest.io/blog/install-package-r/


# With a text editor, create a new requirements.txt file which can be installed with the function --file

r-hunspell # a spelling editor for R

r-spatstat

r-devtools #a R package development tool

r-markdown

r-ape

r-phangorn

r-mass

r-matrix

r-reshape2

r-openmx

r-stringr


2. Checking where your R libraries are installed:

#Activate R431 environment:

$mamba activate R431 (if not already activated)


#Start R

$R


# To check the installed R libraries, use the following at the R prompt (indicated by >).

>.libPaths()


You will see the following on your screen. Substitute usr with your user id (your user id can be obtained using: $whoami).

[1] "/home/usr/mambaforge/envs/R431x/lib/R/library"


Note that when launching rstudio, your active environment will automatically point to the correct location of where your R library packages are installed as is the case above. However, if you find more than one path than comment out R_LIBS=….. in the .Renviron file in the home directory as it could be directing R_LIBS= to the wrong R library.


#Now exit R and return to your active R431 prompt:

>quit()


3. Starting rstudio:

$mamba activate R431 (Activate the environment, if not already active)

$rstudio

rstudio is a GUI tool to access your R software and is much more convenient and easier to use than R, especially for those who are new to R. You can find a lot of useful information on rstudio on the web, some of which is free and some of which are paid courses - https://www.datacamp.com/tutorial/r-studio-tutorial.


In summary:

  • mamba provides a convenient way to manage R and rstudio installations.
  • Consider creating separate environments for different projects.
  • Be aware of potential version conflicts between R and rstudio.
  • Utilize requirements.txt and packs.R for managing R package dependencies.



III. Installing different perl packages using perlbrew

Though perl is not used as widely as python or R, there are may times when perl is required. As is the case with python, there are many different perl packages that have been developed over time and a bioinformatics may require a specific perl package for it to work.

1. Check the default perl version that comes with your operating system:

$perl --version

  • For ubuntu 18.04, the default system perl version is v5.26.1
  • For ubuntu 22.10, the default system perl version is v34


2. Installing perl packages:

Installing a different perl over your operating system perl will break perl and your operating system. Perlbrew, an admin-free perl installation management tool is widely used to install new perl versions. The latest version of perlbrew is 0.98. The perlbrew website is https://perlbrew.pl

Perlbrew can be installed to a default location (refer to 2A) or optionally it can be installed to a location selected by the user (refer to 2B). All the subsequent steps from 3 to 9 are common to both installations (2a and 2B).

2A. Download perlbrew package and install to the usr default location

#Paste the following to download and install with bash

$curl -L https://install.perlbrew.pl | bash

# Wait for the installation to complete and activate perlbrew (must be done each time in a new terminal)

$source //home/usr/perl5/perlbrew/etc/bashrc

NOTE: usr = your login name which can be found with $whoami

2B. Install perlbrew package to a location of your choice (OPTIONAL)

(https://stackoverflow.com/questions/32719245/how-can-i-move-perlbrew-root-to-another-directory

#Deactivate the mamba environment and get back to your normal prompt

$conda deactivate

#Set path of perlbrew to the location of your choice, than download perlbrew and install it. Perlbrew will be installed in the dir selected with PERLBREW_ROOT

$export PERLBREW_ROOT=/pvol/opt/perl5/perlbrew$curl -kL https://install.perlbrew.pl | bash

# Wait for the installation to complete and activate perlbrew (must be done each time in a new terminal)

$source /pvol/opt/perl5/perlbrew/etc/bashrc

NOTE: usr = your login name which can be found with $whoami

The following are common steps to either 2A or 2B:


3. Install cpanm:

You must install cpanm with perlbrew – if you don’t, weird things can happen when you switch perls and try to install tools.

$perlbrew install-cpanm


4. List all available installable perl versions:

# List all available perl packages

$perlbrew available

# perl

perl-5.38.2

perl-5.36.3

perl-5.34.3

perl-5.32.1

perl-5.30.3

perl-5.28.3

perl-5.26.3

perl-5.24.4

perl-5.22.4

perl-5.20.3

perl-5.18.4

perl-5.16.3

perl-5.14.4

perl-5.12.5

perl-5.10.1

perl-5.8.9

perl-5.6.2


5. Select and install a perl package:

There are three different ways of installing a perl package depending on whether you want a mutli- threaded perl package or a non-mutithreaded perl package.

a) Install a non-multithreaded perl package selected from the above available list

$perlbrew install perl-5.32.1

b) If a non-mutlithread perl package has already been installed, than a multithreaded version of the same perl package can be installed:

$perlbrew install perl-5.32.1 --as=5.32.1t -Dusethreads

c) If a new multithread perl package is to be installed than use the following command:

$perlbrew install perl-5.38.2 -Dusethreads -Duselargefiles -Dcccdlflags=-fPIC -Doptimize=-O2 -Duseshrplib -Duse64bitall -Darchname=x86_64-linux-gnu -Dccflags=-DDEBIAN --as threaded-perl-5.38.2t

NOTE: Installing a new perl package will take time. You can track the status of the installation by copying and pasting the prompt that comes up on the screen in a new terminal. In the case of installing 3c:

$tail -f ~/perl5/perlbrew/build-threaded-perl-5.38.2t.log


6. List all the installed perl packages in your computer

$perlbrew list

The following perl packages have been installed in my computer. An asterisk (*) shows the version that is in current use.

threaded-perl-5.39.4

perl-5.38.0

perl-5.36.3

threaded-perl-5.36.1

perl-5.36.0

* perl-5.22.0


7. Select the perl version you wish to use

$perlbrew switch threaded-perl-5.39.4


8. Check the current perl version environment

$perl -version


9. Deleting all perl packages installed by perlbrew

All perl packages are installed in ~/perl5/perlbrew/perls/. Simply delete the directory perl5 to delete all perl versions installed with perlbrew





Devtulya chander

Ph.D. in Industrial Microbiology(fermentation)

7 个月

Thanks a lot sir, this is a great piece of information that too for beginners to intermediate people like me.

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了