PLS Discriminant Analysis (PLS-DA)
…is a classical PLS regression (with a regression mode) but where the response variable is categorical. i.e. indicates the classes/categories of the samples. PLS-DA has often been used for classification and discrimination problems (i.e. supervised classification), even though PLS was not originally designed for this purpose. The response vector Y is qualitative and is recoded as a dummy block matrix where each of the response categories are coded via an indicator variable. PLS-DA is then run as if Y was a continuous matrix. Note that this might be wrong from a theoretical point of view, however, it has been previously shown that this works well in practice.
The parameters to choose by the user here is the number of components or dimensions ncomp, it is usually set to k -1 where k is the number of classes.
Sparse PLS Discriminant Analysis (sPLS-DA)
…is based on the same concept as sPLS to allow variable selection, except that this time, the variables are only selected in the X data set and in a supervised framework, i.e. we are selecting the X-variables with respect to different categories of the samples.
The parameters to choose by the user here is the number of components or dimensions ncomp and the number of variables to choose in the X data set keepX. See the SRBCT case study that illustrates some criteria to choose these parameters.
Usage in mixOmics
(s)PLS-DA is implemented in mixOmics via the functions plsda and splsda as displayed below. For both plsdaand splsda, we strongly advise to work with a training and a testing set (see the function predict). For now, we only illustrate the two approached on the full ‘training’ set.
Remember that you are only selecting the variables in the X data set. Y data should be entered as a factor.
data(liver.toxicity) X <- as.matrix(liver.toxicity$gene) Y <- as.factor(liver.toxicity$treatment[, 4]) # Y is a factor, we chose it as the # time points of necropsy ## PLS-DA function result <- plsda(X, Y, ncomp = 3) # where ncomp is the number of components wanted ## sPLS-DA function result <- splsda(X, Y, ncomp = 3, keepX = c(10, 10, 10)) # where keepX is the number # of variables selected # for each components
With (s)PLS-DA, the classes of new samples or observations can be predicted in the model by using thepredict function. This is an example to perform 3-fold cross-validation. Normally 10-fold cross-validation should be performed several times and the results should be averaged to get a better estimation of the generalization performance:
data(liver.toxicity) X <- as.matrix(liver.toxicity$gene) Y <- as.factor(liver.toxicity$treatment[, 4]) # Y is a factor, we chose it as # the time points of necropsy i <- 1 samp <- sample(1:3, nrow(X), replace = TRUE) # Creation of a list of the same size # as X containing 1, 2 or 3 test <- which(samp == i) # Search which column in samp has a value of 1 train <- setdiff(1:nrow(X), test) # Keeping the column that are not in test ## For PLS-DA plsda.train <- plsda(X[train, ], Y[train], ncomp = 3) test.predict <- predict(plsda.train, X[test, ], method = "max.dist") ## For sPLS-DA splsda.train <- splsda(X[train, ], Y[train], ncomp = 3, keepX = c(10, 10, 10)) test.predict <- predict(splsda.train, X[test, ], method = "max.dist")
see also sPLS-DA:srbct case study
References -in addition to from (s)PLS
- Pérez-Enciso M. and Tenenhaus M. (2003) Prediction of clinical outcome with microarray data: a partial least squares discriminant analysis (PLS-DA) approach. Human Genetics 112, pp 581-592.
- Nguyen D.V. and Rocke D.M. (2002) Tumor classification by partial least squares using microarray gene expression data. Bioinformatics 18, pp 39-50.
- Lê Cao K.-A., Boitard S. and Besse P. (2011) Sparse PLS Discriminant Analysis: biologically relevant feature selection and graphical displays for multiclass problems. Submitted.