(s)PLS

Partial Least Squares regression (PLS)

…is a multivariate projection-based method that can address many types of problems. It is extremely efficient when p + q >> n . As it performs local regression, PLS does not encounter any numerical issue unlike CCA. Unlike PCA that maximizes the variance in a single data set, PLS maximizes the covariance between two data sets by seeking for linear combinations of the variables from both sets. These linear combinations are called the latent variables. The weight vectors that are used to compute the linear combinations are called the loading vectors. Both latent variables and loading vectors come in pairs (one for each data set). Several frameworks are proposed in PLS:

  • regression mode: this is to model a causal relationship between the two data sets, i.e. PLS will predict Y from X
  • canonical mode: similar to CCA, this mode is used to model a bi-directional relationship between the two data sets
  • invariant mode: performs a redundancy analysis (the Y matrix is not deflated)
  • classic mode: is the classical PLS as proposed in Tenenhaus (1998)

Similar to CCA and PCA, the parameter to choose is the number of dimensions or components ncomp. This can be guided with the help of graphical outputs such as plotIndiv, plot3dIndiv. The Mean Squared Error Prediction (MSEP), R2 and Q2 can also be obtained from the valid function that performs cross-validation, solely for regression, invariant and classic modes.

Sparse Partial Least Squares regression (sPLS)

Even though PLS is highly efficient in the high dimensional context, interpretability is needed to get more insight into the biological study. sPLS has been recently developed by our team to perform simultaneous variable selection in the two data sets (Lê Cao et al., 2008). Variable selection is achieved by introducing LASSO penalization on the pair of loading vectors. Both regression and canonical modes are available. In addition to the number of dimensions or components ncomp to choose, the user will have to specify the number of variables to selection on each dimension and for each data set KeepX, KeepY. One criterion that is proposed to tune this parameter is to use the valid function and perform cross-validation or leave-one-out to compute the MSEP, R2 and Q2. This is solely for a regression mode. However, in the complex case of highly dimensional omics data sets, the proposed statistical criteria may not be satisfactory enough to address the biological question. Sometimes it is best that the user chooses the number of variables to select based on his/her intuition and the posterior biological interpretation of the results.

Usage in mixOmics

PLS is implemented in mixOmics via the functions pls and spls as displayed below:

## PLS
data(liver.toxicity)
X <- liver.toxicity$gene
Y <- liver.toxicity$clinic
result <- pls(X, Y, ncomp = 3)  # where ncomp is the number of dimensions
                                   # or components to choose

## sPLS mode can be "regression" or "canonical"
## keepX and keepY are the number of variable to select on each component
result <- spls(X, Y, ncomp = 3, mode = "regression",
                  keepX = c(50, 50, 50), keepY = c(10, 10, 10))

(s)PLS allows matrices with missing values by using NIPALS algorithm to estimate them. See the example with PCA. The computation of the Mean Squared Error Prediction (MSEP) can be done using the valid function as displayed below:

## Using spls with 10-fold CV
error.spls <- valid(X, Y, mode = "regression", method = 'spls', ncomp = 3,
                       M = 10, validation = 'Mfold', criterion = "MSEP")

## Where methods can be 'pls' or 'spls' and validation "loo" or "Mfold" and M is the
## number of folds when using Mfold
error.spls$MSEP

see also sPLS:Liver Toxicity

References

  • PLS:
    • Tenenhaus M. (1998) La régression PLS: théorie et pratique. Paris: Editions Technic.
    • Geladi P. and Kowalski B.R. (1986) Partial Least Squares Regression: A Tutorial. Analytica Chimica Acta 185, pp 1-17.
    • Wold H. (1966) Estimation of principal components and related models by iterative least squares. In: Krishnaiah, P.R. (editors). Multivariate Analysis. Academic Press, N.Y., pp 391-420.
  • sPLS:
    • Lê Cao K.-A., Martin P.G.P., Robert-Granié C. and Besse P. (2009) Sparse canonical methods for biological data integration: application to a cross-platform study. BMC Bioinformatics 10(34).
    • Lê Cao K.-A., Rossouw D., Robert-Granié C. and Besse P. (2008) A sparse PLS for variable selection when integrating Omics data. Statistical Applications in Genetics and Molecular Biology 7, article 35.
    • Shen H. and Huang J.Z. (2008) Sparse principal component analysis via regularized low rank matrix approximation. Journal of Multivariate Analysis 99, pp 1015-1034.