The Science of Imputation: Bridging the Data Gap

The Science of Imputation: Bridging the Data Gap

Imputing missing values allows more statistical methods to be applied, simplifies the programming effort, and can also often extract more insights from the existing data.

Have you ever struggled with a dataset with missing values? Missing values can make statistical analysis more difficult and might hide important information.

Let’s look at a small example, an experiment with five lipids measured. We only have all the measured values for lipid CE 18:1;0. All other lipids have at least one missing value.

Many researchers start their analysis with a principal component analysis (PCA). However, a PCA does not accept missing data points. Therefore, you should either use another method or try to work around the problem. This can be done either by using only completely measured lipids due to the cost of high information loss or by imputing the missing data points.

Typical lipidomic dataset with missing values highlighted in yellow.

There are many methods available today for imputing missing data points and you are often faced with the question of which one to choose. The answer to this question is not easy, because there is no single best method. But there is probably a method that works best for your data.

An important question to ask is why the data points are missing. Is it completely random? Or is there some factor (tracked or untracked) that influences the probability of a data point being missing? Although these questions are important for the choice of the imputation method, it is not possible to answer them with confidence for every missing data point.

We know in our lipidomics context, that many data points are missing because they are below the limit of detection. Therefore, we can be confident that imputing all the values using the mean or median value for the corresponding lipid would result in poor imputation quality. Also imputing these missing data points consequently with zero cannot be correct, because values below a detection limit are not necessarily zero. Therefore, it seems to be reasonable to impute all missing data points by half the detection limit. However, this is also problematic, because it is obvious, that not all missing data points of the same lipid are given by the same value. It would be unrealistic to impute all missing values of CE 20:1;0 in our example by half the limit of detection. That would mean that 6 out of 9 samples would have the same imputed value for this lipid.

There is therefore a need for a more advanced method. At Lipotype we have investigated many different imputation methods, applied them to lipidomic datasets, performed simulation studies and concluded, that the k-nearest neighbor truncation approach described by Shah et al. (2017, “Distribution based nearest neighbor imputation for truncated high dimensional data with applications to pre-clinical and clinical metabolomics studies”) works best. In brief, k-nearest neighbor truncation, unlike other imputation methods, takes the limit of detection into account. Missing values are imputed by first transforming lipid-wise all values to a common scale, then finding the nearest neighbor (=lipid) for the lipid containing missing values, imputing the missing values with that of the nearest neighbor and finally back-transforming the data. Below is the data table of our example after imputing the missing values.

Full lipidomic dataset with yellow values imputed.

The advantage of a complete dataset is not just the ability to use a broader range of methods and simplifying the programming process. The dataset can now be more informative, e.g. statistical tests can be applied at all or with greater confidence. Although the benefits of imputation should be treated with caution, imputation simplifies all further statistical analyses and expands the possibilities.


At Lipotype, we combine cutting-edge lipidomics technology with decades of leading lipid research expertise to deliver convincing results. To celebrate Statistics Month, for the very first time, we're having a 50% discount on our statistical report packages - a deal like no other in our history. Discount only available in October and November 2023 (*Terms & Conditions apply).

Click the link to seize the advantage today and get your lipidomics data comprehended: https://www.lipotype.com/2023/09/october-is-lipotype-statistics-month/

*Terms & Conditions: The discount for statistical report packages applies only to statistical report package orders received before 30.11.2023 and is not valid for previous purchases. The percentage discount is pre-tax total only and has no cash value.


要查看或添加评论,请登录

社区洞察

其他会员也浏览了