How To Deal With Missing Lipidomics Values
Imputation can help to plug gaps in lipidomics data, but which technique works best with your data?
Complete the form below to unlock access to ALL audio articles.
Lipidomics analysis enables the measurement of thousands of lipids in samples. However, sometimes certain lipids are not measured in a particular sample resulting in missing data.
In a multi-institutional study led by Nicolas Frölich from Lipotype, scientists looked at common methods used to impute missing values in lipidomics datasets and tested how suitable they were for different types of missing data, with a special focus on values below the limit of detection. The work was published in the journal Proteomics.
Substituting missing values
Missing values can complicate various analyses; for example, techniques like principal component analysis (PCA) require complete datasets. Failing to fill in missing values would limit the extent to which data can be analyzed, but a process called imputation, which involves substituting missing data with alternative values, can help.
There are three main types of missing data: missing completely at random (MCAR), where data point absence is not related to any known or unknown factors; missing at random (MAR), where it depends only on available information; and missing not at random (MNAR), where it's influenced by unknown factors.
There are, however, numerous different imputation methods to choose from that may not all work well with all data types. The researchers aimed to evaluate which imputation methods were most suited to which types of missing data, using both simulated and real-world shotgun lipidomics datasets, and assess how imputing missing values affected statistical testing.
No “one-size-fits-all” solution
Plasma lipidomics data from the GeneRISK cohort and mouse tissue lipidomics data were analyzed. Datasets underwent a filtering step to ensure a minimum percentage of lipid measurements per sample. For the simulated datasets, lipid variables were simulated following normal and lognormal distributions to mimic real-world scenarios.
Various imputation methods, representing the most commonly used techniques in omics research, were then applied to these datasets, such as mean/median imputation for MCAR data and zero/half-minimum (HM) imputation for MNAR data. Additionally, random forest imputation and k-nearest neighbor methods (knn-TN, knn-EU, knn-CR) were employed.
The performance of imputation methods was evaluated using metrics such as relative bias (rBias) and normalized root mean square error (NRMSE). This evaluation helped to determine the effectiveness of each method under different conditions, such as varying percentages of missing values.
The key findings of the paper were:
- HM imputation performs well for values below the limit of detection, while zero imputation consistently gives poor results.
- Mean imputation showed better results when applied to MCAR compared to median imputation.
- Random forest imputation is promising for MCAR data but less so for MNAR data.
- knn-TN or knn-CR methods can handle both MCAR and MNAR data.
- knn-TN or knn-CR with log transformation are recommended for shotgun lipidomics data analysis.
Simplifying the lives of lipidomics researchers
The research presented here covers a broad spectrum of imputation techniques, ranging from traditional methods like zero, half-minimum, mean and median imputation, to more sophisticated approaches such as k-nearest neighbor and random forest imputation. This comprehensive approach ensures that researchers have a wide variety of tools they can use when dealing with missing data.
By using both simulated and real datasets in this study design, the researchers enhanced the practical relevance and applicability of their findings. This dual approach enabled a thorough assessment of the imputation methods under various conditions, providing scientists with confidence in the efficacy of the techniques in real-world scenarios.
Of particular significance is the discovery that, collectively, the set of imputation methods analyzed in this research article demonstrate effectiveness when applied to all types of missingness. This finding is crucial, as identifying the type of missingness is often a challenging task in practice. The current study offers a practical solution that can be implemented by lipidomics researchers.
Using the results of this study as a guideline, researchers can enhance the quality and reliability of lipidomic data analysis output, in turn leading to more accurate scientific conclusions and interpretations.
Additionally, scientists working in the field of lipidomics can benefit from streamlined data analysis processes. The identification of robust imputation techniques can save time and resources, allowing researchers to focus more on data interpretation and hypothesis testing.
Finally, insights gained from this research can inform the development of new imputation techniques tailored specifically for lipidomics data. This research contributes to the decision-making process and streamlining of statistical analyses in the field of lipidomics, thereby simplifying the lives of scientists. The challenge of handling missing data points in lipidomic datasets can now be effectively addressed by applying the results of this study.
It is, however, important to note potential limitations of this study, for example, data specificity. The findings might be specific to the evaluated real and simulated datasets used in the study and may not generalize well to other omics datasets or mass spectrometry techniques.
Another limitation of the research is the selection of imputation methods. While the study explores various imputation methods, there could be other techniques not included in the analysis that might perform better under certain conditions.
The study focused on analyzing particular correlation structures in data. Data with different correlation structures may show different performances when the same methods are applied.
Additionally, the accuracy of the simulation-based approaches used in the study could impact the reliability of the findings, as simulations may not fully capture the complexity of real-world data.
Finally, the evaluation of imputation methods' performance may be based on specific metrics or criteria that might not fully capture the practical utility or real-world impact of the techniques.
Improving lipidomics data literacy
The results of this study can be integrated into the educational programs for omics professionals, especially those in the lipidomics field. This will help scientists learn about the importance of handling missing data and understand the imputation methods using real or simulated datasets.
The authors hope that the results of this study encourage the development of more open-source software packages that implement the described imputation methods. This would enable scientists to apply these techniques to their own omics datasets easily and contribute to the advancement of data analysis tools in the field.
Reference: Frölich N, Klose C, Widén E, Ripatti S, Gerl MJ. Imputation of missing values in lipidomic datasets. Proteomics. Published online 11 April, 2024:2300606. doi:10.1002/pmic.202300606