Canonical Correlation Analysis
…is a multivariate exploratory approach to highlight correlation between two data sets acquired on the same experimental units. In the same vein as PCA, CCA seeks for linear combinations of the variables (called canonical variates to reduce the dimension of the data sets, but this time while trying to maximize the correlation between the two variates (the canonical correlation).
Similar to PCA, the user has to choose the number of canonical variates pair (ncomp) to summarize as much information as possible.
Regularized Canonical Correlation Analysis (rCCA)
Classical CCA assumes that p < n and q < n , where p and q are the number of variables in each set. In the high dimensional setting usually encountered with biological data, where p + q >> n + 1, CCA cannot be performed:
The greatest canonical correlations are close to 1 as the recovering of canonical subspace does not provide any meaningful information.
We obtain nearly ill-conditioned sample covariance matrices due to the collinearities or near-collinearities in one or both data sets. The computation of the inverses of these sample covariance matrices is unreliable.
Therefore, a regularization step must be included. Such a regularization in this context was first proposed by Vinod (1976), then developed by Leurgans et al. (1993). It consists in the regularization of the empirical covariances matrices of X and Y by adding a multiple of the matrix identity (Id): Cov(X) + λ1Id and Cov(Y) + λ2Id.
In addition to the number of dimensions ncomp to choose, in rCCA, the two parameters to tune are therefore the regularization (or l2 penalties) λ1 and λ2. This is done using cross-validation with the function estim.regul (see below). Note that these two parameters remain unchanged for all dimensions of rCCA. This tuning step may take some computation time.
Usage in mixOmics
CCA and rCCA are implemented in mixOmics via the function rcc as displayed below.
data(nutrimouse) X <- nutrimouse$lipid Y <- nutrimouse$gene ## Regularized CCA result <- rcc(X, Y, ncomp = 3, lambda1 = 0.064, lambda2 = 0.008)
The tuning of λ1 and λ2 requires the user to set a grid of values to test:
grid1 <- seq(0, 0.2, length = 5) # User to choose grid2 <- seq(0.0001, 0.2, length = 5) ## Validation can be "loo" (leave-one-out) or "Mfold" (M-fold cross validation) result <- estim.regul(X, Y, grid1 = grid1, grid2 = grid2, validation = "loo")
It will then display the optimal values for λ1 and λ2 the corresponding graph as displayed below:
lambda1 = 0.1 lambda2 = 0.050075 CV-score = 0.8505446
see also rCCA:Nutrimouse case study
References
- Leurgans S.E., Moyeed R.A. and Silverman B.W. (1993) Canonical correlation analysis when the data are curves.Journal of the Royal Statistical Society. Series B 55, pp 725-740.
- Vinod H.D. (1976) Canonical ridge and econometrics of joint production. Journal of Econometrics 6, pp 129-137.
- González I., Déjean S., Martin P.G.P and Baccini, A. (2008) CCA: An R Package to Extend Canonical Correlation Analysis. Journal of Statistical Software, 23(12).