Principal Component Analysis (PCA)
…is a mathematical procedure that uses orthogonal linear transformation of data from possibly correlated variables into uncorrelated principal components (PCs). The first principal component explains as much of the variability in the data as possible, and each following PC explains as much of the remaining variability as possible. Only the PCs which explain the most variance are retained. This is why choosing the number of dimensions or component (ncomp) is crucial (see the function pcatune, below).
We propose two ways to perform PCA:
- with singular value decomposition of the data matrix (SVD) for a computationally efficient way as proposed by the R function prcomp in the stat package, or
- in the case of missing values, with the Non-linear Iterative Partial Least Squares (NIPALS), which is an iterative power method.
Both methods are embedded in the PCA function and will be chosen accordingly.
Input data should be centered (center = TRUE) and possibly (sometimes preferably) scaled so that all variables have a unit variance. This is especially advised in the case where the variance is not homogeneous across variables (scale. = TRUE).
Sparse Principal Component Analysis (sPCA)
…is a variant of PCA and allows variable selection. sPCA in mixOmics is based on singular value decomposition (SVD) and sparsity is achieved via LASSO (based on the method proposed by Shen and Huang (2008) and the method of penalization in sPLS).
When applying sparse PCA, the orthogonality between the principal components, and the loading vectors is lost. We used the method of Witten et al. (2009) to force orthogonality among PCs. Our experience has shown that setting scale. = TRUE helps a lot in obtaining orthogonal sparse loading vectors.
The number of variables to select on each PCA dimension has to be chosen by the user (KeepX). Note that the proportion of explained variance significantly drops compared to PCA. This is to be expected and has been mentioned many times in the literature.
Usage in mixOmics
(s)PCA is implemented in mixOmics via the function pca and spca:
data(liver.toxicity) X <- liver.toxicity$gene # Using one data set only ## PCA example: data were centered but not scaled result <- pca(X, ncomp = 3, center = TRUE, scale. = FALSE); ## sPCA example: we are selecting 50 variables on each of the PCs result <- spca(X, ncomp = 3, center = TRUE, scale. = TRUE, keepX = rep(50,3));
The optimal number of components can be determined by using the pcatune function as displayed below:
pcatune(X, ncomp = 10, center = TRUE, scale. = FALSE)
see also PCA:Multidrug case study
References
PCA:
- Jolliffe I.T. (2002) Principal Component Analysis. Springer Series in Statistics, Springer, New York.
sPCA:
- Shen H. and Huang J.Z. (2008) Sparse principal component analysis via regularized low rank matrix approximation. Journal of Multivariate Analysis 99(6), pp 1015-1034.
- Witten D.M. and Tibshirani R. and Hastie T. (2009) A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10(3).