Sparse PLS regression
data(liver.toxicity) X <- liver.toxicity$gene Y <- liver.toxicity$clinic ## Regression mode: select 50 genes and keep all clinical variables on each dimension result <- spls(X, Y, ncomp = 3, mode = 'regression', keepX = c(50, 50, 50), keepY = c(10, 10, 10))
In this specific case, we chose not to select any clinical variable as they are very few in number.
Comparisons with PLS and numerical results
When using regression mode, it is possible to perform cross-validation and estimate the Root Mean Squared Error for each variable in the data set Y. We can then compare the results obtained with PLS with all the variables and sPLS with only the 50 selected variables on each dimension:
## Mfold cross validation will be faster: liver.valid <- valid(X, Y, ncomp = 3, mode = 'regression', method = 'spls', keepX = c(50, 50, 50), validation = 'Mfold', M = 10) ## Leave one out cross-validation for sPLS and PLS (this may take some time to run) liver.spls.loo <- valid(X, Y, ncomp = 3, mode = 'regression', method = 'spls', keepX = c(50, 50, 50), validation = 'loo') liver.pls.loo <- valid(X, Y, ncomp = 3, mode = 'regression', method = 'pls', validation = 'loo') ## Compare the RMSEP criterion for the first 9 clinical variables: palette(rainbow(9)) par(mfrow = c(3, 3)) for(i in 1:9){ spls.rmsep <- sqrt(liver.spls.loo$MSEP[i, ]) pls.rmsep <- sqrt(liver.pls.loo$MSEP[i, ]) matplot(cbind(spls.rmsep, pls.rmsep), type = 'l', col = i, ylab = 'RMSEP', lwd = 2, xlab = 'dim', lty = c(1, 2), axes = FALSE) axis(1, 1:3, labels = 1:3) axis(2) title(main = paste(rownames(liver.spls.loo$MSEP)[i])) } palette("default")
The full (dash) line represent sPLS (PLS). We expect the prediction error to be lower in the case of sPLS as many noisy variables have been removed to improve the prediction of the Y variables. This kind of output can also help choosing the number of variables to select, as well as choosing the number of dimensions.
## Compare the R2 criterion for the first 9 clinical variables: palette(rainbow(9)) par(mfrow = c(3, 3)) for(i in 1:9){ spls.r2 = liver.spls.loo$R2[i, ] pls.r2 = liver.pls.loo$R2[i, ] matplot(cbind(spls.r2, pls.r2), type = 'l', col = i, ylab = 'R2', lwd = 2, xlab = 'dim', lty = c(1, 2), axes = FALSE) axis(1, 1:3, labels = 1:3) axis(2) title(main = paste(rownames(liver.spls.loo$R2)[i])) } palette("default")
With the 'R2'
, we would expect it to be better using sPLS than PLS.
Another function that could be used is the plot.valid
function:
plot(liver.spls.loo, criterion = 'RMSEP', type = 'l', layout = c(3, 4)) plot(liver.spls.loo, criterion = 'R2', type = 'l', layout = c(3, 4))