Preliminary analysis with PCA
data(srbct) ## The gene expression data X <- srbct$gene pca.srbct <- pca(X, ncomp = 3, center = TRUE, scale. = TRUE) ## Represent the samples on the first 2 principal components: ## first color the samples with respect to their classes col.srbct <- as.numeric(as.factor(srbct$class)) col.srbct[col.srbct == 1] <- 'red' col.srbct[col.srbct == 2] <- 'blue' col.srbct[col.srbct == 3] <- 'black' col.srbct[col.srbct == 4] <- 'green' plotIndiv(pca.srbct, col = col.srbct, ind.names = FALSE, pch = 16)
Most of the samples are mixed with each other. Let see what happens now if we include the information about the classes of the samples, and if we are selecting the relevant genes that help classifying the sample.
sPLS-DA analysis
## X is the gene expression data set X <- srbct$gene ## Y is the response variable indicating the class of each sample Y <- srbct$class ## In sPLS-DA, variable selection is only allowed on the X data set, ## here we select 50 genes on each component result <- splsda(X, Y, ncomp = 3, keepX = c(50, 50, 50))
In this specific case, the barplot seems to indicate that after 5 principal components, there is a drop in the amount of explained variance. However, this is up to the user to choose the number of principal components ncomp
for the ease of interpretation, we set ncomp = 3
in the remaining analysis.