Independent PCA analysis
In some case studies, we have identified some limitations when using PCA:
- PCA assumes that gene expression follows a multivariate normal distribution and recent studies have demonstrated that microarray gene expression measurements follow instead a super-Gaussian distribution
- PCA decomposes the data based on the maximization of its variance. In some cases, the biological question may not be related to the highest variance in the data
Instead, we propose to apply Independent Principal Component Analysis (IPCA) which combines the advantages of both PCA and Independent Component Analysis (ICA). It uses ICA as a denoising process of the loading vectors produced by PCA to better highlight the important biological entities and reveal insightful patterns in the data. A sparse version is also proposed (sIPCA). This approach was proposed in collaboration with Eric F. Yao (QFAB and University of Shanghai).
The algorithm of IPCA is as follows:
1. The original data matrix is centered (by default).
2. PCA is used to reduce dimension and generate the loading vectors.
3. ICA (FastICA) is implemented on the loading vectors to generate independent loading vectors.
4. The centered data matrix is projected on the independent loading vectors to obtain the independent principal components.
IPCA offers a better visualization of the data than ICA and with a smaller number of components than PCA.
Sparse Independent Principal Component Analysis (sIPCA)
Similar to the sparse PCA version implemented in mixOmics, soft-thresholding is applied in the independent loading vectors in IPCA to perform internal variable selection.
How to choose the number of variables to select:
The number of variables to select is still an open issue. In our paper we proposed to use the Davies Bouldin measure which is an index of crisp cluster validity. This index compares the within-cluster scatter with the between-cluster separation.