Bioinformatics and Beyond: March 2025
Hello! Welcome to the March Bioinformatics and Beyond newsletter!
This edition is focused on batch effects. Specifically, how to identify them and correct them. So let's get started.
What are batch effects?
Batch effects are systematic, non-biological variations often introduced during -omics data generation by technical factors such as sample preparation, reagents, instrumentation, or sequencing runs across sample batches.?
Unaddressed batch effects can obscure true biological variation, inflate false discovery rates, and compromise statistical power, leading to misleading conclusions and reduced reproducibility. Therefore, identifying and mitigating batch effects is crucial for reliable data interpretation.
How can batch effects be identified?
Exploring and visualising data is key to identifying batch effects. Exploratory data analysis can reveal patterns in data that align with known or hidden batch effects, indicating technical rather than biological variation.?
Dimensionality reduction methods such as principal component analysis (PCA) and multidimensional scaling (MDS) are commonly used for this purpose. By plotting samples on these reduced dimensions while annotating by known experimental variables, clusters of samples that may be driven by batch effects rather than biological variables can be identified.?
Similarly, hierarchical clustering methods group samples based on similarity measures such as between-sample correlation or Euclidean distance. These methods can be used to construct heatmaps or dendrograms which visually represent sample relationships and can reveal batch-driven clusters or branches.?
How can batch effects be corrected using computational methods?
There are two main strategies to address batch effects.?
The first approach involves removing batch effects directly from the data and using the batch-corrected data in downstream analysis. ComBat is a widely used algorithm for adjusting data for known batch effects while preserving biological signal. ComBat uses an empirical Bayes approach which assumes that systematic batch biases affect many features similarly and leverages shared information to adjust for batch effects by shrinking batch-specific mean and variance toward a global estimate. When the source of batch effects is unknown, or technical information is incomplete or unavailable, surrogate variable analysis (SVA) can be used to identify latent sources of variability that correlate with the technical noise, and estimate surrogate variables for this variation which can then be removed from the data. The number of surrogate variables can either be specified directly by the user or determined automatically through a permutation-based procedure, providing flexibility to adapt to different scenarios.?
To successfully apply batch correction, it is essential to consider the nature of the data. For example, ComBat assumes a Gaussian distribution, making it unsuitable for RNA-sequencing count data. In such cases, ComBat-seq, an extension of the original method, should be used instead, as it relies on negative binomial regression to properly model count-based distributions. For other data types, filtering and normalization are typically required before applying ComBat to meet its Gaussian assumption.?
Alternatively, batch effects or surrogate variables can be explicitly modelled in statistical analysis without altering the data. Linear or generalised models can incorporate known batch sources as covariates to account for their contribution to data variance, and interaction terms can further capture condition-specific batch effects (e.g., expression ~ batch + condition + batch:condition). The limma R package is a widely used tool for implementing these models in -omics analyses, providing a framework to conduct hypothesis testing while controlling for batch effects.?
Want to learn more about batch effects and how to mitigate them? All of the above information was provided by Dr Andrés G. de la Filia, a Bioinformatics Team Leader and colleague of mine at Fios Genomics. He published a blog about batch effects on the Fios Genomics website, which includes the information above along with further details and examples, which you can view here if interested.
Before you go though, how about an interesting fact?
Closed pinecones can indicate rain. This is because pine cones close when the air is humid, which can indicate rain is own it's way.
(And if you are wondering, yes it is raining while I am writing this newsletter ???)
Thanks for reading!
-Breige McBride,?Marketing Manager, Fios Genomics