Performance of methods that separate common and distinct variation in multiple data blocks
Publication details
Journal : Journal of Chemometrics , vol. 33 , 2019
International Standard Numbers
:
Printed
:
0886-9383
Electronic
:
1099-128X
Publication type : Academic article
Issue : 1
Links
:
ARKIV
:
http://hdl.handle.net/11250/25...
DOI
:
doi.org/10.1002/cem.3085
If you have questions about the publication, you may contact Nofima’s Chief Librarian.
Kjetil Aune
Chief Librarian
kjetil.aune@nofima.no
Summary
In many areas of science, multiple sets of data are collected from the samples. Such data sets can be analysed by data fusion (or multi-block) methods. The aim is usually to get a holistic understanding of the system or better prediction of some response. Lately, several scientific groups have developed methods for separating common and distinct variation between multiple data blocks. Although the objective is the same, the strategies and algorithms are completely different for these methods. In this paper, we investigate the practical aspects of the four most popular methods for separating common and distinct variation: JIVE, DISCO, PCA-GCA and OnPLS. The main barrier complicating the use of any of these methods is model selection and validation. Especially when the numbers of blocks is more than two. By the use of extensive simulations we have elucidated the three properties that are important for assessing the validity of the results: The ability to identify the correct model, the ability to estimate the true, underlying subspaces, and the robustness towards misspecification of the model. The simulated datasets mimic a range of “real life” data, with different dimensionalities and variance structures. We are thus able to identify which methods work best for different types of data structures, and pinpoint weak spots for each method. The results show that PCA-GCA works best for model selection, while JIVE and DISCO give the best estimates of the subspaces and are most robust towards model misspecification.