Title: | PLS Analyses for Genomics |
---|---|
Description: | Routines for PLS-based genomic analyses, implementing PLS methods for classification with microarray data and prediction of transcription factor activities from combined ChIP-chip analysis. The >=1.2-1 versions include two new classification methods for microarray data: GSIM and Ridge PLS. The >=1.3 versions includes a new classification method combining variable selection and compression in logistic regression context: logit-SPLS; and an adaptive version of the sparse PLS. |
Authors: | Anne-Laure Boulesteix <[email protected]>, Ghislain Durif <[email protected]>, Sophie Lambert-Lacroix <[email protected]>, Julie Peyre <[email protected]>, and Korbinian Strimmer <[email protected]>. |
Maintainer: | Ghislain Durif <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.5-3 |
Built: | 2025-02-24 04:17:47 UTC |
Source: | https://github.com/gdurif/plsgenomics |
Gene expression data (2000 genes for 62 samples) from the microarray experiments of Colon tissue samples of Alon et al. (1999).
data(Colon)
data(Colon)
This data set contains 62 samples with 2000 genes: 40 tumor tissues, coded 2 and 22 normal tissues, coded 1.
A list with the following elements:
X |
a (62 x 2000) matrix giving the expression levels of 2000 genes for the 62 Colon tissue samples. Each row corresponds to a patient, each column to a gene. |
Y |
a numeric vector of length 62 giving the type of tissue sample (tumor or normal). |
gene.names |
a vector containing the names of the 2000 genes for the gene
expression matrix |
The data are described in Alon et al. (1999). The data was originally
collected from http://microarray.princeton.edu/oncology/affydata/index.html
that is now (in 2024) http://genomics-pubs.princeton.edu/oncology/affydata/index.html.
Alon, U. and Barkai, N. and Notterman, D.A. and Gish, K. and Ybarra, S. and Mack, D. and Levine, A.J. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays, Proc. Natl. Acad. Sci. USA,96(12), 6745–6750.
# load plsgenomics library library(plsgenomics) # load data set data(Colon) # how many samples and how many genes ? dim(Colon$X) # how many samples of class 1 and 2 respectively ? sum(Colon$Y==1) sum(Colon$Y==2)
# load plsgenomics library library(plsgenomics) # load data set data(Colon) # how many samples and how many genes ? dim(Colon$X) # how many samples of class 1 and 2 respectively ? sum(Colon$Y==1) sum(Colon$Y==2)
Gene expression data obtained during Escherichia coli carbon source transition and connectivity data from the RegulonDB data base (Salgado et al., 2001). The experiments and data sets are described in Kao et al. (2003).
data(Ecoli)
data(Ecoli)
A list with the following components:
CONNECdata |
a (100 x 16) matrix containing the connectivity data for 100 genes and 16 regulators. The data are coded as 1 (positive interaction), 0 (no interaction) and -1 (negative interaction). |
GEdata |
a (100 x 23) matrix containing gene expression data for 100 genes and 23 samples corresponding to differents times during carbon source transition. |
timepoint |
a numeric vector of length 23 containing the time points (in hours) for the 23 samples. |
The data are described in Kao et al. (2004). The data was originally
collected from
http://www.seas.ucla.edu/~liaoj/downloads.html
that became
https://www.seas.ucla.edu/liao_lab//downloads.html but the dataset
does not appear to be available anymore and we could not find a
replacement link in 2024.
K. Kao, Y.-L. Yang, R. Boscolo, C. Sabatti, V. Roychowdhury and J. C. Liao (2004). Transcriptome-based determination of multiple transcription regulator activities in Escherichia coli by using network component analysis, PNAS 101, 641–646.
H. Salgado, A. Santos-Zavaleta, S. Gama-Castro, D. Millan-Zarate, E. Diaz-Peredo, F. Sanchez-Solano, E. Perez-Rueda, C. Bonavides-Martinez and J. Collado-Vides (2001). RegulonDB (version 3.2): transcriptional regulation and operon organization in Escherichia coli K-12, Nucleic Acids Research 29, 72–74.
# load plsgenomics library library(plsgenomics) # load data set data(Ecoli) # how many genes and how many transcription factors ? dim(Ecoli$CONNECdata)
# load plsgenomics library library(plsgenomics) # load data set data(Ecoli) # how many genes and how many transcription factors ? dim(Ecoli$CONNECdata)
The function gsim
performs prediction using Lambert-Lacroix and Peyre's GSIM algorithm.
gsim(Xtrain, Ytrain, Xtest=NULL, Lambda, hA, hB=NULL, NbIterMax=50)
gsim(Xtrain, Ytrain, Xtest=NULL, Lambda, hA, hB=NULL, NbIterMax=50)
Xtrain |
a (ntrain x p) data matrix of predictors. |
Ytrain |
a ntrain vector of responses. |
Xtest |
a (ntest x p) matrix containing the predictors for the test data
set. |
Lambda |
a positive real value. |
hA |
a strictly positive real value. |
hB |
a strictly positive real value. |
NbIterMax |
a positive integer. |
The columns of the data matrices Xtrain
and Xtest
may not be standardized,
since standardizing is performed by the function gsim
as a preliminary step
before the algorithm is run.
The procedure described in Lambert-Lacroix and Peyre (2005) is used to estimate
the projection direction beta. When Xtest
is not equal to NULL, the procedure predicts the labels for these new predictor variables.
A list with the following components:
Ytest |
the ntest vector containing the predicted labels for the observations from
|
beta |
the p vector giving the projection direction estimated. |
hB |
the value of hB used in step B of GSIM (value given by the user or estimated by plug-in if the argument value was equal to NULL) |
DeletedCol |
the vector containing the column number of |
Cvg |
the 0-1 value indicating convergence of the algorithm (1 for convergence, 0 otherwise). |
Sophie Lambert-Lacroix (http://membres-timc.imag.fr/Sophie.Lambert/) and Julie Peyre (https://membres-ljk.imag.fr/Julie.Peyre/).
S. Lambert-Lacroix, J. Peyre . (2006) Local likelyhood regression in generalized linear single-index models with applications to microarrays data. Computational Statistics and Data Analysis, vol 51, n 3, 2091-2113.
# load plsgenomics library library(plsgenomics) # load Colon data data(Colon) IndexLearn <- c(sample(which(Colon$Y==2),12),sample(which(Colon$Y==1),8)) Xtrain <- Colon$X[IndexLearn,] Ytrain <- Colon$Y[IndexLearn] Xtest <- Colon$X[-IndexLearn,] # preprocess data resP <- preprocess(Xtrain= Xtrain, Xtest=Xtest,Threshold = c(100,16000),Filtering=c(5,500), log10.scale=TRUE,row.stand=TRUE) # perform prediction by GSIM res <- gsim(Xtrain=resP$pXtrain,Ytrain= Ytrain,Xtest=resP$pXtest,Lambda=10,hA=50,hB=NULL) res$Cvg sum(res$Ytest!=Colon$Y[-IndexLearn])
# load plsgenomics library library(plsgenomics) # load Colon data data(Colon) IndexLearn <- c(sample(which(Colon$Y==2),12),sample(which(Colon$Y==1),8)) Xtrain <- Colon$X[IndexLearn,] Ytrain <- Colon$Y[IndexLearn] Xtest <- Colon$X[-IndexLearn,] # preprocess data resP <- preprocess(Xtrain= Xtrain, Xtest=Xtest,Threshold = c(100,16000),Filtering=c(5,500), log10.scale=TRUE,row.stand=TRUE) # perform prediction by GSIM res <- gsim(Xtrain=resP$pXtrain,Ytrain= Ytrain,Xtest=resP$pXtest,Lambda=10,hA=50,hB=NULL) res$Cvg sum(res$Ytest!=Colon$Y[-IndexLearn])
The function gsim.cv
determines the best ridge regularization parameter and bandwidth to be
used for classification with GSIM as described in Lambert-Lacroix and Peyre (2005).
gsim.cv(Xtrain, Ytrain,LambdaRange,hARange,hB=NULL, NbIterMax=50)
gsim.cv(Xtrain, Ytrain,LambdaRange,hARange,hB=NULL, NbIterMax=50)
Xtrain |
a (ntrain x p) data matrix of predictors. |
Ytrain |
a ntrain vector of responses. |
LambdaRange |
the vector of positive real value from which the best ridge regularization parameter has to be chosen by cross-validation. |
hARange |
the vector of strictly positive real value from which the best bandwidth has to be chosen by cross-validation for GSIM step A. |
hB |
a strictly positive real value. |
NbIterMax |
a positive integer. |
The cross-validation procedure described in Lambert-Lacroix and Peyre (2005)
is used to determine the best ridge regularization parameter and bandwidth to be
used for classification with GSIM for binary data (for categorical data see
mgsim
and mgsim.cv
).
At each cross-validation run, Xtrain
is split into a pseudo training
set (ntrain - 1 samples) and a pseudo test set (1 sample) and the classification error rate is
determined for each value of ridge regularization parameter and bandwidth. Finally, the function
gsim.cv
returns the values of the ridge regularization parameter and
bandwidth for which the mean classification error rate is minimal.
A list with the following components:
Lambda |
the optimal regularization parameter. |
hA |
the optimal bandwidth parameter. |
Sophie Lambert-Lacroix (http://membres-timc.imag.fr/Sophie.Lambert/) and Julie Peyre (https://membres-ljk.imag.fr/Julie.Peyre/).
S. Lambert-Lacroix, J. Peyre . (2006) Local likelyhood regression in generalized linear single-index models with applications to microarrays data. Computational Statistics and Data Analysis, vol 51, n 3, 2091-2113.
## Not run: ## between 5~15 seconds # load plsgenomics library library(plsgenomics) # load Colon data data(Colon) IndexLearn <- c(sample(which(Colon$Y==2),12),sample(which(Colon$Y==1),8)) Xtrain <- Colon$X[IndexLearn,] Ytrain <- Colon$Y[IndexLearn] Xtest <- Colon$X[-IndexLearn,] # preprocess data resP <- preprocess(Xtrain= Xtrain, Xtest=Xtest,Threshold = c(100,16000),Filtering=c(5,500), log10.scale=TRUE,row.stand=TRUE) # Determine optimum h and lambda hl <- gsim.cv(Xtrain=resP$pXtrain,Ytrain=Ytrain,hARange=c(7,20),LambdaRange=c(0.1,1),hB=NULL) # perform prediction by GSIM res <- gsim(Xtrain=resP$pXtrain,Ytrain=Ytrain,Xtest=resP$pXtest,Lambda=hl$Lambda,hA=hl$hA,hB=NULL) res$Cvg sum(res$Ytest!=Colon$Y[-IndexLearn]) ## End(Not run)
## Not run: ## between 5~15 seconds # load plsgenomics library library(plsgenomics) # load Colon data data(Colon) IndexLearn <- c(sample(which(Colon$Y==2),12),sample(which(Colon$Y==1),8)) Xtrain <- Colon$X[IndexLearn,] Ytrain <- Colon$Y[IndexLearn] Xtest <- Colon$X[-IndexLearn,] # preprocess data resP <- preprocess(Xtrain= Xtrain, Xtest=Xtest,Threshold = c(100,16000),Filtering=c(5,500), log10.scale=TRUE,row.stand=TRUE) # Determine optimum h and lambda hl <- gsim.cv(Xtrain=resP$pXtrain,Ytrain=Ytrain,hARange=c(7,20),LambdaRange=c(0.1,1),hB=NULL) # perform prediction by GSIM res <- gsim(Xtrain=resP$pXtrain,Ytrain=Ytrain,Xtest=resP$pXtest,Lambda=hl$Lambda,hA=hl$hA,hB=NULL) res$Cvg sum(res$Ytest!=Colon$Y[-IndexLearn]) ## End(Not run)
Gene expression data (3051 genes and 38 tumor mRNA samples) from the leukemia microarray study of Golub et al. (1999).
data(leukemia)
data(leukemia)
A list with the following elements:
X |
a (38 x 3051) matrix giving the expression levels of 3051 genes for 38 leukemia patients. Each row corresponds to a patient, each column to a gene. |
Y |
a numeric vector of length 38 giving the cancer class of each patient. |
gene.names |
a matrix containing the names of the 3051 genes for the gene
expression matrix |
The dataset was taken from
the R package multtest. The data are described in Golub et al. (1999).
The data was originally collected from
http://www.broadinstitute.org/cgi-bin/cancer/publications/pub_paper.cgi?paper_id=43
but the URL is not working anymore and we could not find a replacement link
in 2024.
S. Dudoit, J. Fridlyand and T. P. Speed (2002). Comparison of discrimination methods for the classification of tumors using gene expression data, Journal of the American Statistical Association 97, 77–87.
Golub et al. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science 286, 531–537.
# load plsgenomics library library(plsgenomics) # load data set data(leukemia) # how many samples and how many genes ? dim(leukemia$X) # how many samples of class 1 and 2, respectively ? sum(leukemia$Y==1) sum(leukemia$Y==2)
# load plsgenomics library library(plsgenomics) # load data set data(leukemia) # how many samples and how many genes ? dim(leukemia$X) # how many samples of class 1 and 2, respectively ? sum(leukemia$Y==1) sum(leukemia$Y==2)
The function logit.spls
performs compression and variable selection
in the context of binary classification (with possible prediction)
using Durif et al. (2018) algorithm based on Ridge IRLS and sparse PLS.
logit.spls( Xtrain, Ytrain, lambda.ridge, lambda.l1, ncomp, Xtest = NULL, adapt = TRUE, maxIter = 100, svd.decompose = TRUE, center.X = TRUE, scale.X = FALSE, weighted.center = TRUE )
logit.spls( Xtrain, Ytrain, lambda.ridge, lambda.l1, ncomp, Xtest = NULL, adapt = TRUE, maxIter = 100, svd.decompose = TRUE, center.X = TRUE, scale.X = FALSE, weighted.center = TRUE )
Xtrain |
a (ntrain x p) data matrix of predictor values.
|
Ytrain |
a (ntrain) vector of (continuous) responses. |
lambda.ridge |
a positive real value. |
lambda.l1 |
a positive real value, in [0,1]. |
ncomp |
a positive integer. |
Xtest |
a (ntest x p) matrix containing the predictor values for the
test data set. |
adapt |
a boolean value, indicating whether the sparse PLS selection step sould be adaptive or not (see details). |
maxIter |
a positive integer. |
svd.decompose |
a boolean parameter. |
center.X |
a boolean value indicating whether the data matrices
|
scale.X |
a boolean value indicating whether the data matrices
|
weighted.center |
a boolean value indicating whether the centering should take into account the weighted l2 metric or not in the SPLS step. |
The columns of the data matrices Xtrain
and Xtest
may
not be standardized, since standardizing can be performed by the function
logit.spls
as a preliminary step.
The procedure described in Durif et al. (2018) is used to compute
latent sparse components that are used in a logistic regression model.
In addition, when a matrix Xtest
is supplied, the procedure
predicts the response associated to these new values of the predictors.
An object of class logit.spls
with the following attributes
Coefficients |
the (p+1) vector containing the linear coefficients associated to the predictors and intercept in the logistic model explaining the response Y. |
hatY |
the (ntrain) vector containing the estimated response value on
the train set |
hatYtest |
the (ntest) vector containing the predicted labels
for the observations from |
DeletedCol |
the vector containing the indexes of columns with null
variance in |
A |
the active set of predictors selected by the procedures.
|
Anames |
Vector of selected predictor names, i.e. the names of the
columns from |
converged |
a {0,1} value indicating whether the RIRLS algorithm did
converge in less than |
X.score |
a (n x ncomp) matrix being the observations coordinates or
scores in the new component basis produced by the SPLS step (sparse PLS).
Each column t.k of |
X.weight |
a (p x ncomp) matrix being the coefficients of predictors
in each components produced by sparse PLS. Each column w.k of
|
Xtrain |
the design matrix. |
sXtrain |
the scaled predictor matrix. |
Ytrain |
the response observations. |
sPseudoVar |
the scaled pseudo-response produced by the RIRLS algorithm. |
lambda.ridge |
the Ridge hyper-parameter used to fit the model. |
lambda.l1 |
the sparse hyper-parameter used to fit the model. |
ncomp |
the number of components used to fit the model. |
V |
the (ntrain x ntrain) matrix used to weight the metric in the
sparse PLS step. |
proba |
the (ntrain) vector of estimated probabilities for the
observations in code |
proba.test |
the (ntest) vector of predicted probabilities for the
new observations in |
Ghislain Durif (https://gdurif.perso.math.cnrs.fr/).
Durif, G., Modolo, L., Michaelsson, J., Mold, J.E., Lambert-Lacroix, S., Picard, F., 2018. High dimensional classification with combined adaptive sparse PLS and logistic regression. Bioinformatics 34, 485–493. doi:10.1093/bioinformatics/btx571. Available at http://arxiv.org/abs/1502.05933.
## Not run: ### load plsgenomics library library(plsgenomics) ### generating data n <- 100 p <- 100 sample1 <- sample.bin(n=n, p=p, kstar=20, lstar=2, beta.min=0.25, beta.max=0.75, mean.H=0.2, sigma.H=10, sigma.F=5) X <- sample1$X Y <- sample1$Y ### splitting between learning and testing set index.train <- sort(sample(1:n, size=round(0.7*n))) index.test <- (1:n)[-index.train] Xtrain <- X[index.train,] Ytrain <- Y[index.train,] Xtest <- X[index.test,] Ytest <- Y[index.test,] ### fitting the model, and predicting new observations model1 <- logit.spls(Xtrain=Xtrain, Ytrain=Ytrain, lambda.ridge=2, lambda.l1=0.5, ncomp=2, Xtest=Xtest, adapt=TRUE, maxIter=100, svd.decompose=TRUE) str(model1) ### prediction error rate sum(model1$hatYtest!=Ytest) / length(index.test) ## End(Not run)
## Not run: ### load plsgenomics library library(plsgenomics) ### generating data n <- 100 p <- 100 sample1 <- sample.bin(n=n, p=p, kstar=20, lstar=2, beta.min=0.25, beta.max=0.75, mean.H=0.2, sigma.H=10, sigma.F=5) X <- sample1$X Y <- sample1$Y ### splitting between learning and testing set index.train <- sort(sample(1:n, size=round(0.7*n))) index.test <- (1:n)[-index.train] Xtrain <- X[index.train,] Ytrain <- Y[index.train,] Xtest <- X[index.test,] Ytest <- Y[index.test,] ### fitting the model, and predicting new observations model1 <- logit.spls(Xtrain=Xtrain, Ytrain=Ytrain, lambda.ridge=2, lambda.l1=0.5, ncomp=2, Xtest=Xtest, adapt=TRUE, maxIter=100, svd.decompose=TRUE) str(model1) ### prediction error rate sum(model1$hatYtest!=Ytest) / length(index.test) ## End(Not run)
The function logit.spls.cv
chooses the optimal values for the
hyper-parameter of the logit.spls
procedure, by minimizing the
averaged error of prediction over the hyper-parameter grid,
using Durif et al. (2018) LOGIT-SPLS algorithm.
logit.spls.cv( X, Y, lambda.ridge.range, lambda.l1.range, ncomp.range, adapt = TRUE, maxIter = 100, svd.decompose = TRUE, return.grid = FALSE, ncores = 1, nfolds = 10, nrun = 1, center.X = TRUE, scale.X = FALSE, weighted.center = TRUE, seed = NULL, verbose = TRUE )
logit.spls.cv( X, Y, lambda.ridge.range, lambda.l1.range, ncomp.range, adapt = TRUE, maxIter = 100, svd.decompose = TRUE, return.grid = FALSE, ncores = 1, nfolds = 10, nrun = 1, center.X = TRUE, scale.X = FALSE, weighted.center = TRUE, seed = NULL, verbose = TRUE )
X |
a (n x p) data matrix of predictors. |
Y |
a (n) vector of (continuous) responses. |
lambda.ridge.range |
a vector of positive real values.
|
lambda.l1.range |
a vecor of positive real values, in [0,1].
|
ncomp.range |
a vector of positive integers. |
adapt |
a boolean value, indicating whether the sparse PLS selection step sould be adaptive or not (see details). |
maxIter |
a positive integer, the maximal number of iterations in the RIRLS algorithm (see details). |
svd.decompose |
a boolean parameter. |
return.grid |
a boolean values indicating whether the grid of hyper-parameters values with corresponding mean prediction error rate over the folds should be returned or not. |
ncores |
a positve integer, indicating the number of cores that the cross-validation is allowed to use for parallel computation (see details). |
nfolds |
a positive integer indicating the number of folds in the
K-folds cross-validation procedure, |
nrun |
a positive integer indicating how many times the K-folds cross- validation procedure should be repeated, default is 1. |
center.X |
a boolean value indicating whether the data matrices
|
scale.X |
a boolean value indicating whether the data matrices
|
weighted.center |
a boolean value indicating whether the centering should take into account the weighted l2 metric or not in the SPLS step. |
seed |
a positive integer value (default is NULL). If non NULL, the seed for pseudo-random number generation is set accordingly. |
verbose |
a boolean parameter indicating the verbosity. |
The columns of the data matrices X
may not be standardized,
since standardizing is performed by the function logit.spls.cv
as a preliminary step.
The procedure is described in Durif et al. (2018). The K-fold cross-validation can be summarize as follow: the train set is partitioned into K folds, for each value of hyper-parameters the model is fit K times, using each fold to compute the prediction error rate, and fitting the model on the remaining observations. The cross-validation procedure returns the optimal hyper-parameters values, meaning the one that minimize the averaged error of prediction averaged over all the folds.
This procedures uses mclapply
from the parallel
package,
available on GNU/Linux and MacOS. Users of Microsoft Windows can refer to
the README file in the source to be able to use a mclapply type function.
An object of class logit.spls
with the following attributes
lambda.ridge.opt |
the optimal value in |
lambda.l1.opt |
the optimal value in |
ncomp.opt |
the optimal value in |
conv.per |
the overall percentage of models that converge during the cross-validation procedure. |
cv.grid |
the grid of hyper-parameters and corresponding prediction
error rate averaged over the folds. |
Ghislain Durif (https://gdurif.perso.math.cnrs.fr/).
Durif, G., Modolo, L., Michaelsson, J., Mold, J.E., Lambert-Lacroix, S., Picard, F., 2018. High dimensional classification with combined adaptive sparse PLS and logistic regression. Bioinformatics 34, 485–493. doi:10.1093/bioinformatics/btx571. Available at http://arxiv.org/abs/1502.05933.
## Not run: ### load plsgenomics library library(plsgenomics) ### generating data n <- 100 p <- 100 sample1 <- sample.bin(n=n, p=p, kstar=10, lstar=2, beta.min=0.25, beta.max=0.75, mean.H=0.2, sigma.H=10, sigma.F=5) X <- sample1$X Y <- sample1$Y ### hyper-parameters values to test lambda.l1.range <- seq(0.05,0.95,by=0.1) # between 0 and 1 ncomp.range <- 1:10 # log-linear range between 0.01 a,d 1000 for lambda.ridge.range logspace <- function( d1, d2, n) exp(log(10)*seq(d1, d2, length.out=n)) lambda.ridge.range <- signif(logspace(d1 <- -2, d2 <- 3, n=21), digits=3) ### tuning the hyper-parameters cv1 <- logit.spls.cv(X=X, Y=Y, lambda.ridge.range=lambda.ridge.range, lambda.l1.range=lambda.l1.range, ncomp.range=ncomp.range, adapt=TRUE, maxIter=100, svd.decompose=TRUE, return.grid=TRUE, ncores=1, nfolds=10) str(cv1) ## End(Not run)
## Not run: ### load plsgenomics library library(plsgenomics) ### generating data n <- 100 p <- 100 sample1 <- sample.bin(n=n, p=p, kstar=10, lstar=2, beta.min=0.25, beta.max=0.75, mean.H=0.2, sigma.H=10, sigma.F=5) X <- sample1$X Y <- sample1$Y ### hyper-parameters values to test lambda.l1.range <- seq(0.05,0.95,by=0.1) # between 0 and 1 ncomp.range <- 1:10 # log-linear range between 0.01 a,d 1000 for lambda.ridge.range logspace <- function( d1, d2, n) exp(log(10)*seq(d1, d2, length.out=n)) lambda.ridge.range <- signif(logspace(d1 <- -2, d2 <- 3, n=21), digits=3) ### tuning the hyper-parameters cv1 <- logit.spls.cv(X=X, Y=Y, lambda.ridge.range=lambda.ridge.range, lambda.l1.range=lambda.l1.range, ncomp.range=ncomp.range, adapt=TRUE, maxIter=100, svd.decompose=TRUE, return.grid=TRUE, ncores=1, nfolds=10) str(cv1) ## End(Not run)
The function logit.spls.stab
train a logit-spls model for each
candidate values (ncomp, lambda.l1, lambda.ridge)
of hyper-parameters
on multiple sub-samplings in the data. The stability selection procedure
selects the covariates that are selected by most of the models among the
grid of hyper-parameters, following the procedure described in
Durif et al. (2018). Candidates values for ncomp
, lambda.l1
and lambda.l2
are respectively given by
the input arguments ncomp.range
, lambda.l1.range
and lambda.l2.range
.
logit.spls.stab( X, Y, lambda.ridge.range, lambda.l1.range, ncomp.range, adapt = TRUE, maxIter = 100, svd.decompose = TRUE, ncores = 1, nresamp = 100, center.X = TRUE, scale.X = FALSE, weighted.center = TRUE, seed = NULL, verbose = TRUE )
logit.spls.stab( X, Y, lambda.ridge.range, lambda.l1.range, ncomp.range, adapt = TRUE, maxIter = 100, svd.decompose = TRUE, ncores = 1, nresamp = 100, center.X = TRUE, scale.X = FALSE, weighted.center = TRUE, seed = NULL, verbose = TRUE )
X |
a (n x p) data matrix of predictors. |
Y |
a (n) vector of (continuous) responses. |
lambda.ridge.range |
a vector of positive real values.
|
lambda.l1.range |
a vecor of positive real values, in [0,1].
|
ncomp.range |
a vector of positive integers. |
adapt |
a boolean value, indicating whether the sparse PLS selection step sould be adaptive or not (see details). |
maxIter |
a positive integer, the maximal number of iterations in the RIRLS algorithm (see details). |
svd.decompose |
a boolean parameter. |
ncores |
a positve integer, indicating the number of cores that the cross-validation is allowed to use for parallel computation (see details). |
nresamp |
number of resamplings of the data to estimate the probility of selection for each covariate, default is 100. |
center.X |
a boolean value indicating whether the data matrices
|
scale.X |
a boolean value indicating whether the data matrices
|
weighted.center |
a boolean value indicating whether the centering should take into account the weighted l2 metric or not in the SPLS step. |
seed |
a positive integer value (default is NULL). If non NULL, the seed for pseudo-random number generation is set accordingly. |
verbose |
a boolean parameter indicating the verbosity. |
The columns of the data matrices X
may not be standardized,
since standardizing is performed by the function logit.spls.stab
as a preliminary step.
The procedure is described in Durif et al. (2018). The stability selection procedure can be summarize as follow (c.f. Meinshausen and Buhlmann, 2010).
(i) For each candidate values (ncomp, lambda.l1, lambda.ridge)
of
hyper-parameters, a logit-SPLS is trained on nresamp
resamplings
of the data. Then, for each triplet (ncomp, lambda.l1, lambda.ridge)
,
the probability that a covariate (i.e. a column in X
) is selected is
computed among the resamplings.
(ii) Eventually, the set of "stable selected" variables corresponds to the set of covariates that were selected by most of the training among the grid of hyper-parameters candidate values.
This function achieves the first step (i) of the stability selection
procedure. The second step (ii) is achieved by the function
stability.selection
.
This procedures uses mclapply
from the parallel
package,
available on GNU/Linux and MacOS. Users of Microsoft Windows can refer to
the README file in the source to be able to use a mclapply type function.
An object with the following attributes
q.Lambda |
A table with values of q.Lambda (c.f. Durif et al. (2018) for the notation), being the averaged number of covariates selected among the entire grid of hyper-parameters candidates values, for increasing size of hyper-parameter grid. |
probs.lambda |
A table with estimated probability of selection for each covariates depending on the candidates values for hyper-parameters. |
p |
An integer values indicating the number of covariates in the model. |
Ghislain Durif (https://gdurif.perso.math.cnrs.fr/).
Durif, G., Modolo, L., Michaelsson, J., Mold, J.E., Lambert-Lacroix, S., Picard, F., 2018. High dimensional classification with combined adaptive sparse PLS and logistic regression. Bioinformatics 34, 485–493. doi:10.1093/bioinformatics/btx571. Available at http://arxiv.org/abs/1502.05933.
Meinshausen, N., Buhlmann P. (2010). Stability Selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72, no. 4, 417-473.
logit.spls
, stability.selection
,
stability.selection.heatmap
## Not run: ### load plsgenomics library library(plsgenomics) ### generating data n <- 100 p <- 100 sample1 <- sample.bin(n=n, p=p, kstar=10, lstar=2, beta.min=0.25, beta.max=0.75, mean.H=0.2, sigma.H=10, sigma.F=5) X <- sample1$X Y <- sample1$Y ### pertinent covariates id sample1$sel ### hyper-parameters values to test lambda.l1.range <- seq(0.05,0.95,by=0.1) # between 0 and 1 ncomp.range <- 1:10 # log-linear range between 0.01 a,d 1000 for lambda.ridge.range logspace <- function( d1, d2, n) exp(log(10)*seq(d1, d2, length.out=n)) lambda.ridge.range <- signif(logspace(d1 <- -2, d2 <- 3, n=21), digits=3) ### tuning the hyper-parameters stab1 <- logit.spls.stab(X=X, Y=Y, lambda.ridge.range=lambda.ridge.range, lambda.l1.range=lambda.l1.range, ncomp.range=ncomp.range, adapt=TRUE, maxIter=100, svd.decompose=TRUE, ncores=1, nresamp=100) str(stab1) ### heatmap of estimated probabilities stability.selection.heatmap(stab1) ### selected covariates stability.selection(stab1, piThreshold=0.6, rhoError=10) ## End(Not run)
## Not run: ### load plsgenomics library library(plsgenomics) ### generating data n <- 100 p <- 100 sample1 <- sample.bin(n=n, p=p, kstar=10, lstar=2, beta.min=0.25, beta.max=0.75, mean.H=0.2, sigma.H=10, sigma.F=5) X <- sample1$X Y <- sample1$Y ### pertinent covariates id sample1$sel ### hyper-parameters values to test lambda.l1.range <- seq(0.05,0.95,by=0.1) # between 0 and 1 ncomp.range <- 1:10 # log-linear range between 0.01 a,d 1000 for lambda.ridge.range logspace <- function( d1, d2, n) exp(log(10)*seq(d1, d2, length.out=n)) lambda.ridge.range <- signif(logspace(d1 <- -2, d2 <- 3, n=21), digits=3) ### tuning the hyper-parameters stab1 <- logit.spls.stab(X=X, Y=Y, lambda.ridge.range=lambda.ridge.range, lambda.l1.range=lambda.l1.range, ncomp.range=ncomp.range, adapt=TRUE, maxIter=100, svd.decompose=TRUE, ncores=1, nresamp=100) str(stab1) ### heatmap of estimated probabilities stability.selection.heatmap(stab1) ### selected covariates stability.selection(stab1, piThreshold=0.6, rhoError=10) ## End(Not run)
Visualization of matrix entries in heatmap format, the color scale depends on the numerical values.
matrix.heatmap(mat, ...)
matrix.heatmap(mat, ...)
mat |
the matrix to visualize |
... |
any argument that could be pass to the functions
|
The function matrix.heatmap
is a wrapper for the function
image.plot
from the 'fields' package.
No return, just plot the heatmap in the current graphic window.
Ghislain Durif (https://gdurif.perso.math.cnrs.fr/).
logit.spls
, stability.selection
,
stability.selection.heatmap
### load plsgenomics library library(plsgenomics) ### generate a matrix A = matrix(runif(10*10), ncol=10) ### heatmap of estimated probabilities matrix.heatmap(A)
### load plsgenomics library library(plsgenomics) ### generate a matrix A = matrix(runif(10*10), ncol=10) ### heatmap of estimated probabilities matrix.heatmap(A)
The function mgsim
performs prediction using Lambert-Lacroix and Peyre's MGSIM algorithm.
mgsim(Ytrain,Xtrain,Lambda,h,Xtest=NULL,NbIterMax=50)
mgsim(Ytrain,Xtrain,Lambda,h,Xtest=NULL,NbIterMax=50)
Xtrain |
a (ntrain x p) data matrix of predictors. |
Ytrain |
a ntrain vector of responses. |
Xtest |
a (ntest x p) matrix containing the predictors for the test data
set. |
Lambda |
a positive real value. |
h |
a strictly positive real value. |
NbIterMax |
a positive integer. |
The columns of the data matrices Xtrain
and Xtest
may not be standardized,
since standardizing is performed by the function mgsim
as a preliminary step
before the algorithm is run.
The procedure described in Lambert-Lacroix and Peyre (2005) is used to estimate
the c projection directions and the coefficients of the parametric fit obtained
after projecting predictor variables onto the estimated directions. When Xtest
is not equal to NULL, the procedure predicts the labels for these new predictor variables.
A list with the following components:
Ytest |
the ntest vector containing the predicted labels for the observations from
|
beta |
the (p x c) matrix containing the c estimated projection directions. |
Coefficients |
the (2 x c) matrix containing the coefficients of the parametric fit obtained after projecting predictor variables onto these estimated directions. |
DeletedCol |
the vector containing the column number of |
Cvg |
the 0-1 value indicating convergence of the algorithm (1 for convergence, 0 otherwise). |
Sophie Lambert-Lacroix (http://membres-timc.imag.fr/Sophie.Lambert/) and Julie Peyre (https://membres-ljk.imag.fr/Julie.Peyre/).
S. Lambert-Lacroix, J. Peyre . (2006) Local likelyhood regression in generalized linear single-index models with applications to microarrays data. Computational Statistics and Data Analysis, vol 51, n 3, 2091-2113.
# load plsgenomics library library(plsgenomics) # load SRBCT data data(SRBCT) IndexLearn <- c(sample(which(SRBCT$Y==1),10),sample(which(SRBCT$Y==2),4), sample(which(SRBCT$Y==3),7),sample(which(SRBCT$Y==4),9)) # perform prediction by MGSIM res <- mgsim(Ytrain=SRBCT$Y[IndexLearn],Xtrain=SRBCT$X[IndexLearn,],Lambda=0.001,h=19, Xtest=SRBCT$X[-IndexLearn,]) res$Cvg sum(res$Ytest!=SRBCT$Y[-IndexLearn]) # prediction for another sample Xnew <- SRBCT$X[83,] # projection of Xnew onto the c estimated direction Xproj <- Xnew %*% res$beta # Compute the linear predictor for each classes expect class 1 eta <- diag(cbind(rep(1,3),t(Xproj)) %*% res$Coefficients) Ypred <- which.max(c(0,eta)) Ypred SRBCT$Y[83]
# load plsgenomics library library(plsgenomics) # load SRBCT data data(SRBCT) IndexLearn <- c(sample(which(SRBCT$Y==1),10),sample(which(SRBCT$Y==2),4), sample(which(SRBCT$Y==3),7),sample(which(SRBCT$Y==4),9)) # perform prediction by MGSIM res <- mgsim(Ytrain=SRBCT$Y[IndexLearn],Xtrain=SRBCT$X[IndexLearn,],Lambda=0.001,h=19, Xtest=SRBCT$X[-IndexLearn,]) res$Cvg sum(res$Ytest!=SRBCT$Y[-IndexLearn]) # prediction for another sample Xnew <- SRBCT$X[83,] # projection of Xnew onto the c estimated direction Xproj <- Xnew %*% res$beta # Compute the linear predictor for each classes expect class 1 eta <- diag(cbind(rep(1,3),t(Xproj)) %*% res$Coefficients) Ypred <- which.max(c(0,eta)) Ypred SRBCT$Y[83]
The function mgsim.cv
determines the best ridge regularization parameter and bandwidth to
be used for classification with MGSIM as described in Lambert-Lacroix and Peyre (2005).
mgsim.cv(Ytrain,Xtrain,LambdaRange,hRange,NbIterMax=50)
mgsim.cv(Ytrain,Xtrain,LambdaRange,hRange,NbIterMax=50)
Xtrain |
a (ntrain x p) data matrix of predictors. |
Ytrain |
a ntrain vector of responses. |
LambdaRange |
the vector of positive real value from which the best ridge regularization parameter has to be chosen by cross-validation. |
hRange |
the vector of strictly positive real value from which the best bandwidth has to be chosen by cross-validation. |
NbIterMax |
a positive integer. |
The cross-validation procedure described in Lambert-Lacroix and Peyre (2005)
is used to determine the best ridge regularization parameter and bandwidth to be
used for classification with GSIM for categorical data (for binary data see
gsim
and gsim.cv
).
At each cross-validation run, Xtrain
is split into a pseudo training
set (ntrain-1 samples) and a pseudo test set (1 sample) and the
classification error rate is determined for each
value of ridge regularization parameter and bandwidth. Finally, the function
mgsim.cv
returns the values of the ridge regularization parameter and
bandwidth for which the mean classification error rate is minimal.
A list with the following components:
Lambda |
the optimal regularization parameter. |
h |
the optimal bandwidth parameter. |
Sophie Lambert-Lacroix (http://membres-timc.imag.fr/Sophie.Lambert/) and Julie Peyre (https://membres-ljk.imag.fr/Julie.Peyre/).
S. Lambert-Lacroix, J. Peyre . (2006) Local likelyhood regression in generalized linear single-index models with applications to microarrays data. Computational Statistics and Data Analysis, vol 51, n 3, 2091-2113.
## Not run: ## between 5~15 seconds # load plsgenomics library library(plsgenomics) # load SRBCT data data(SRBCT) IndexLearn <- c(sample(which(SRBCT$Y==1),10),sample(which(SRBCT$Y==2),4), sample(which(SRBCT$Y==3),7),sample(which(SRBCT$Y==4),9)) ### Determine optimum h and lambda # /!\ take 30 secondes to run #hl <- mgsim.cv(Ytrain=SRBCT$Y[IndexLearn],Xtrain=SRBCT$X[IndexLearn,], # LambdaRange=c(0.1),hRange=c(7,20)) ### perform prediction by MGSIM #res <- mgsim(Ytrain=SRBCT$Y[IndexLearn],Xtrain=SRBCT$X[IndexLearn,],Lambda=hl$Lambda, # h=hl$h,Xtest=SRBCT$X[-IndexLearn,]) #res$Cvg #sum(res$Ytest!=SRBCT$Y[-IndexLearn]) ## End(Not run)
## Not run: ## between 5~15 seconds # load plsgenomics library library(plsgenomics) # load SRBCT data data(SRBCT) IndexLearn <- c(sample(which(SRBCT$Y==1),10),sample(which(SRBCT$Y==2),4), sample(which(SRBCT$Y==3),7),sample(which(SRBCT$Y==4),9)) ### Determine optimum h and lambda # /!\ take 30 secondes to run #hl <- mgsim.cv(Ytrain=SRBCT$Y[IndexLearn],Xtrain=SRBCT$X[IndexLearn,], # LambdaRange=c(0.1),hRange=c(7,20)) ### perform prediction by MGSIM #res <- mgsim(Ytrain=SRBCT$Y[IndexLearn],Xtrain=SRBCT$X[IndexLearn,],Lambda=hl$Lambda, # h=hl$h,Xtest=SRBCT$X[-IndexLearn,]) #res$Cvg #sum(res$Ytest!=SRBCT$Y[-IndexLearn]) ## End(Not run)
The function mrpls
performs prediction using Fort et al. (2005) MRPLS algorithm.
mrpls(Ytrain,Xtrain,Lambda,ncomp,Xtest=NULL,NbIterMax=50)
mrpls(Ytrain,Xtrain,Lambda,ncomp,Xtest=NULL,NbIterMax=50)
Xtrain |
a (ntrain x p) data matrix of predictors. |
Ytrain |
a ntrain vector of responses. |
Xtest |
a (ntest x p) matrix containing the predictors for the test data
set. |
Lambda |
a positive real value. |
ncomp |
a positive integer. |
NbIterMax |
a positive integer. |
The columns of the data matrices Xtrain
and Xtest
may not be standardized,
since standardizing is performed by the function mrpls
as a preliminary step
before the algorithm is run.
The procedure described in Fort et al. (2005) is used to determine
latent components to be used for classification and when Xtest
is not equal to NULL, the procedure predicts the labels for these new
predictor variables.
A list with the following components:
Coefficients |
the (p+1) x c matrix containing the coefficients weighting the block design matrix. |
hatY |
the ntrain vector containing the estimated {0,...,c}-valued labels for the
observations from |
hatYtest |
the ntest vector containing the predicted {0,...,c}-valued labels for the
observations from |
proba |
the ntrain vector containing the estimated probabilities for the
observations from |
proba.test |
the ntest vector containing the predicted probabilities for the
observations from |
DeletedCol |
the vector containing the column number of |
hatYtest_k |
If |
Sophie Lambert-Lacroix (http://membres-timc.imag.fr/Sophie.Lambert/).
G. Fort, S. Lambert-Lacroix and Julie Peyre (2005). Reduction de dimension dans les modeles lineaires generalises : application a la classification supervisee de donnees issues des biopuces. Journal de la SFDS, tome 146, n1-2, 117-152.
# load plsgenomics library library(plsgenomics) # load SRBCT data data(SRBCT) IndexLearn <- c(sample(which(SRBCT$Y==1),10),sample(which(SRBCT$Y==2),4), sample(which(SRBCT$Y==3),7),sample(which(SRBCT$Y==4),9)) # perform prediction by MRPLS res <- mrpls(Ytrain=SRBCT$Y[IndexLearn]-1,Xtrain=SRBCT$X[IndexLearn,],Lambda=0.001,ncomp=2, Xtest=SRBCT$X[-IndexLearn,]) sum(res$Ytest!=SRBCT$Y[-IndexLearn]-1) # prediction for another sample Xnew <- SRBCT$X[83,] # Compute the linear predictor for each classes expect class 1 eta <- diag(t(cbind(c(1,Xnew),c(1,Xnew),c(1,Xnew))) %*% res$Coefficients) Ypred <- which.max(c(0,eta)) Ypred+1 SRBCT$Y[83]
# load plsgenomics library library(plsgenomics) # load SRBCT data data(SRBCT) IndexLearn <- c(sample(which(SRBCT$Y==1),10),sample(which(SRBCT$Y==2),4), sample(which(SRBCT$Y==3),7),sample(which(SRBCT$Y==4),9)) # perform prediction by MRPLS res <- mrpls(Ytrain=SRBCT$Y[IndexLearn]-1,Xtrain=SRBCT$X[IndexLearn,],Lambda=0.001,ncomp=2, Xtest=SRBCT$X[-IndexLearn,]) sum(res$Ytest!=SRBCT$Y[-IndexLearn]-1) # prediction for another sample Xnew <- SRBCT$X[83,] # Compute the linear predictor for each classes expect class 1 eta <- diag(t(cbind(c(1,Xnew),c(1,Xnew),c(1,Xnew))) %*% res$Coefficients) Ypred <- which.max(c(0,eta)) Ypred+1 SRBCT$Y[83]
The function mrpls.cv
determines the best ridge regularization parameter and the best
number of PLS components to be used for classification for Fort et al. (2005) MRPLS algorithm.
mrpls.cv(Ytrain, Xtrain, LambdaRange, ncompMax, NbIterMax=50, ncores=1)
mrpls.cv(Ytrain, Xtrain, LambdaRange, ncompMax, NbIterMax=50, ncores=1)
Xtrain |
a (ntrain x p) data matrix of predictors. |
Ytrain |
a ntrain vector of responses. |
LambdaRange |
the vector of positive real value from which the best ridge regularization parameter has to be chosen by cross-validation. |
ncompMax |
a positive integer. The best number of components is chosen from
1,..., |
NbIterMax |
a positive integer. |
ncores |
a positive integer. The number of cores to be used for parallel computing (if different from 1) |
A cross-validation procedure is used to determine the best ridge regularization parameter and
number of PLS components to be used for classification with MRPLS for categorical data
(for binary data see rpls
and rpls.cv
).
At each cross-validation run, Xtrain
is split into a pseudo training
set (ntrain-1 samples) and a pseudo test set (1 sample) and the classification error rate is
determined for each value of ridge regularization parameter and number of components. Finally,
the function mrpls.cv
returns the values of the ridge regularization parameter and
bandwidth for which the mean classification error rate is minimal.
A list with the following components:
Lambda |
the optimal regularization parameter. |
ncomp |
the optimal number of PLS components. |
Sophie Lambert-Lacroix (http://membres-timc.imag.fr/Sophie.Lambert/).
G. Fort, S. Lambert-Lacroix and Julie Peyre (2005). Reduction de dimension dans les modeles lineaires generalises : application a la classification supervisee de donnees issues des biopuces. Journal de la SFDS, tome 146, n1-2, 117-152.
## Not run: ## between 5~15 seconds # load plsgenomics library library(plsgenomics) # load SRBCT data data(SRBCT) IndexLearn <- c(sample(which(SRBCT$Y==1),10),sample(which(SRBCT$Y==2),4), sample(which(SRBCT$Y==3),7),sample(which(SRBCT$Y==4),9)) # Determine optimum ncomp and Lambda nl <- mrpls.cv(Ytrain=SRBCT$Y[IndexLearn]-1,Xtrain=SRBCT$X[IndexLearn,], LambdaRange=c(0.1,1),ncompMax=3) # perform prediction by MRPLS res <- mrpls(Ytrain=SRBCT$Y[IndexLearn]-1,Xtrain=SRBCT$X[IndexLearn,],Lambda=nl$Lambda, ncomp=nl$ncomp,Xtest=SRBCT$X[-IndexLearn,]) sum(res$Ytest!=SRBCT$Y[-IndexLearn]-1) ## End(Not run)
## Not run: ## between 5~15 seconds # load plsgenomics library library(plsgenomics) # load SRBCT data data(SRBCT) IndexLearn <- c(sample(which(SRBCT$Y==1),10),sample(which(SRBCT$Y==2),4), sample(which(SRBCT$Y==3),7),sample(which(SRBCT$Y==4),9)) # Determine optimum ncomp and Lambda nl <- mrpls.cv(Ytrain=SRBCT$Y[IndexLearn]-1,Xtrain=SRBCT$X[IndexLearn,], LambdaRange=c(0.1,1),ncompMax=3) # perform prediction by MRPLS res <- mrpls(Ytrain=SRBCT$Y[IndexLearn]-1,Xtrain=SRBCT$X[IndexLearn,],Lambda=nl$Lambda, ncomp=nl$ncomp,Xtest=SRBCT$X[-IndexLearn,]) sum(res$Ytest!=SRBCT$Y[-IndexLearn]-1) ## End(Not run)
The function multinom.spls
performs compression and variable selection
in the context of multi-label ('nclass' > 2) classification
(with possible prediction) using Durif et al. (2018) algorithm
based on Ridge IRLS and sparse PLS.
multinom.spls( Xtrain, Ytrain, lambda.ridge, lambda.l1, ncomp, Xtest = NULL, adapt = TRUE, maxIter = 100, svd.decompose = TRUE, center.X = TRUE, scale.X = FALSE, weighted.center = TRUE )
multinom.spls( Xtrain, Ytrain, lambda.ridge, lambda.l1, ncomp, Xtest = NULL, adapt = TRUE, maxIter = 100, svd.decompose = TRUE, center.X = TRUE, scale.X = FALSE, weighted.center = TRUE )
Xtrain |
a (ntrain x p) data matrix of predictor values.
|
Ytrain |
a (ntrain) vector of (continuous) responses. |
lambda.ridge |
a positive real value. |
lambda.l1 |
a positive real value, in [0,1]. |
ncomp |
a positive integer. |
Xtest |
a (ntest x p) matrix containing the predictor values for the
test data set. |
adapt |
a boolean value, indicating whether the sparse PLS selection step sould be adaptive or not (see details). |
maxIter |
a positive integer. |
svd.decompose |
a boolean parameter. |
center.X |
a boolean value indicating whether the data matrices
|
scale.X |
a boolean value indicating whether the data matrices
|
weighted.center |
a boolean value indicating whether the centering should take into account the weighted l2 metric or not in the SPLS step. |
The columns of the data matrices Xtrain
and Xtest
may
not be standardized, since standardizing can be performed by the function
multinom.spls
as a preliminary step.
The procedure described in Durif et al. (2018) is used to compute
latent sparse components that are used in a multinomial regression model.
In addition, when a matrix Xtest
is supplied, the procedure
predicts the response associated to these new values of the predictors.
An object of class multinom.spls
with the following attributes
Coefficients |
a (p+1) x (nclass-1) matrix containing the linear coefficients associated to the predictors and intercept in the multinomial model explaining the response Y. |
hatY |
the (ntrain) vector containing the estimated response value on
the train set |
hatYtest |
the (ntest) vector containing the predicted labels
for the observations from |
DeletedCol |
the vector containing the indexes of columns with null
variance in |
A |
a list of size nclass-1 with predictors selected by the procedures
for each set of coefficients in the multinomial model (i.e. indexes of the
corresponding non null entries in each columns of |
A.full |
union of elements in A, corresponding to predictors selected in the full model. |
Anames |
Vector of selected predictor names, i.e. the names of the
columns from |
converged |
a {0,1} value indicating whether the RIRLS algorithm did
converge in less than |
X.score |
list of nclass-1 different (n x ncomp) matrices being the observations coordinates or scores in the new component basis produced for each class in the multinomial model by the SPLS step (sparse PLS), see Durif et al. (2018) for details. |
X.weight |
list of nclass-1 different (p x ncomp) matrices being the coefficients of predictors in each components produced for each class in the multinomial model by the sparse PLS, see Durif et al. (2018) for details. |
X.score.full |
a ((n x (nclass-1)) x ncomp) matrix being the
observations coordinates or scores in the new component basis produced
by the SPLS step (sparse PLS) in the linearized multinomial model, see
Durif et al. (2018). Each column t.k of |
X.weight.full |
a (p x ncomp) matrix being the coefficients of predictors
in each components produced by sparse PLS in the linearized multinomial
model, see Durif et al. (2018). Each column w.k of
|
lambda.ridge |
the Ridge hyper-parameter used to fit the model. |
lambda.l1 |
the sparse hyper-parameter used to fit the model. |
ncomp |
the number of components used to fit the model. |
V |
the (ntrain x ntrain) matrix used to weight the metric in the
sparse PLS step. |
proba |
the (ntrain) vector of estimated probabilities for the
observations in code |
proba.test |
the (ntest) vector of predicted probabilities for the
new observations in |
Ghislain Durif (https://gdurif.perso.math.cnrs.fr/).
Durif, G., Modolo, L., Michaelsson, J., Mold, J.E., Lambert-Lacroix, S., Picard, F., 2018. High dimensional classification with combined adaptive sparse PLS and logistic regression. Bioinformatics 34, 485–493. doi:10.1093/bioinformatics/btx571. Available at http://arxiv.org/abs/1502.05933.
spls
, logit.spls
,
multinom.spls.cv
## Not run: ### load plsgenomics library library(plsgenomics) ### generating data n <- 100 p <- 100 nclass <- 3 sample1 <- sample.multinom(n, p, nb.class=nclass, kstar=20, lstar=2, beta.min=0.25, beta.max=0.75, mean.H=0.2, sigma.H=10, sigma.F=5) X <- sample1$X Y <- sample1$Y ### splitting between learning and testing set index.train <- sort(sample(1:n, size=round(0.7*n))) index.test <- (1:n)[-index.train] Xtrain <- X[index.train,] Ytrain <- Y[index.train,] Xtest <- X[index.test,] Ytest <- Y[index.test,] ### fitting the model, and predicting new observations model1 <- multinom.spls(Xtrain=Xtrain, Ytrain=Ytrain, lambda.ridge=2, lambda.l1=0.5, ncomp=2, Xtest=Xtest, adapt=TRUE, maxIter=100, svd.decompose=TRUE) str(model1) ### prediction error rate sum(model1$hatYtest!=Ytest) / length(index.test) ## End(Not run)
## Not run: ### load plsgenomics library library(plsgenomics) ### generating data n <- 100 p <- 100 nclass <- 3 sample1 <- sample.multinom(n, p, nb.class=nclass, kstar=20, lstar=2, beta.min=0.25, beta.max=0.75, mean.H=0.2, sigma.H=10, sigma.F=5) X <- sample1$X Y <- sample1$Y ### splitting between learning and testing set index.train <- sort(sample(1:n, size=round(0.7*n))) index.test <- (1:n)[-index.train] Xtrain <- X[index.train,] Ytrain <- Y[index.train,] Xtest <- X[index.test,] Ytest <- Y[index.test,] ### fitting the model, and predicting new observations model1 <- multinom.spls(Xtrain=Xtrain, Ytrain=Ytrain, lambda.ridge=2, lambda.l1=0.5, ncomp=2, Xtest=Xtest, adapt=TRUE, maxIter=100, svd.decompose=TRUE) str(model1) ### prediction error rate sum(model1$hatYtest!=Ytest) / length(index.test) ## End(Not run)
The function multinom.spls.cv
chooses the optimal values for the
hyper-parameter of the multinom.spls
procedure, by minimizing the
averaged error of prediction over the hyper-parameter grid,
using Durif et al. (2018) multinomial-SPLS algorithm.
multinom.spls.cv( X, Y, lambda.ridge.range, lambda.l1.range, ncomp.range, adapt = TRUE, maxIter = 100, svd.decompose = TRUE, return.grid = FALSE, ncores = 1, nfolds = 10, nrun = 1, center.X = TRUE, scale.X = FALSE, weighted.center = TRUE, seed = NULL, verbose = TRUE )
multinom.spls.cv( X, Y, lambda.ridge.range, lambda.l1.range, ncomp.range, adapt = TRUE, maxIter = 100, svd.decompose = TRUE, return.grid = FALSE, ncores = 1, nfolds = 10, nrun = 1, center.X = TRUE, scale.X = FALSE, weighted.center = TRUE, seed = NULL, verbose = TRUE )
X |
a (n x p) data matrix of predictors. |
Y |
a (n) vector of (continuous) responses. |
lambda.ridge.range |
a vector of positive real values.
|
lambda.l1.range |
a vecor of positive real values, in [0,1].
|
ncomp.range |
a vector of positive integers. |
adapt |
a boolean value, indicating whether the sparse PLS selection step sould be adaptive or not (see details). |
maxIter |
a positive integer, the maximal number of iterations in the RIRLS algorithm (see details). |
svd.decompose |
a boolean parameter. |
return.grid |
a boolean values indicating whether the grid of hyper-parameters values with corresponding mean prediction error rate over the folds should be returned or not. |
ncores |
a positve integer, indicating the number of cores that the cross-validation is allowed to use for parallel computation (see details). |
nfolds |
a positive integer indicating the number of folds in the
K-folds cross-validation procedure, |
nrun |
a positive integer indicating how many times the K-folds cross- validation procedure should be repeated, default is 1. |
center.X |
a boolean value indicating whether the data matrices
|
scale.X |
a boolean value indicating whether the data matrices
|
weighted.center |
a boolean value indicating whether the centering should take into account the weighted l2 metric or not in the SPLS step. |
seed |
a positive integer value (default is NULL). If non NULL, the seed for pseudo-random number generation is set accordingly. |
verbose |
a boolean parameter indicating the verbosity. |
The columns of the data matrices X
may not be standardized,
since standardizing is performed by the function multinom.spls.cv
as a preliminary step.
The procedure is described in Durif et al. (2018). The K-fold cross-validation can be summarize as follow: the train set is partitioned into K folds, for each value of hyper-parameters the model is fit K times, using each fold to compute the prediction error rate, and fitting the model on the remaining observations. The cross-validation procedure returns the optimal hyper-parameters values, meaning the one that minimize the averaged error of prediction averaged over all the folds.
This procedures uses mclapply
from the parallel
package,
available on GNU/Linux and MacOS. Users of Microsoft Windows can refer to
the README file in the source to be able to use a mclapply type function.
An object of class multinom.spls
with the following attributes
lambda.ridge.opt |
the optimal value in |
lambda.l1.opt |
the optimal value in |
ncomp.opt |
the optimal value in |
conv.per |
the overall percentage of models that converge during the cross-validation procedure. |
cv.grid |
the grid of hyper-parameters and corresponding prediction
error rate averaged over the folds. |
Ghislain Durif (https://gdurif.perso.math.cnrs.fr/).
Durif, G., Modolo, L., Michaelsson, J., Mold, J.E., Lambert-Lacroix, S., Picard, F., 2018. High dimensional classification with combined adaptive sparse PLS and logistic regression. Bioinformatics 34, 485–493. doi:10.1093/bioinformatics/btx571. Available at http://arxiv.org/abs/1502.05933.
multinom.spls
, multinom.spls.stab
## Not run: ### load plsgenomics library library(plsgenomics) ### generating data n <- 100 p <- 100 nclass <- 3 sample1 <- sample.multinom(n=n, p=p, nb.class=nclass, kstar=10, lstar=2, beta.min=0.25, beta.max=0.75, mean.H=0.2, sigma.H=10, sigma.F=5) X <- sample1$X Y <- sample1$Y ### hyper-parameters values to test lambda.l1.range <- seq(0.05,0.95,by=0.1) # between 0 and 1 ncomp.range <- 1:10 # log-linear range between 0.01 a,d 1000 for lambda.ridge.range logspace <- function( d1, d2, n) exp(log(10)*seq(d1, d2, length.out=n)) lambda.ridge.range <- signif(logspace(d1 <- -2, d2 <- 3, n=21), digits=3) ### tuning the hyper-parameters cv1 <- multinom.spls.cv(X=X, Y=Y, lambda.ridge.range=lambda.ridge.range, lambda.l1.range=lambda.l1.range, ncomp.range=ncomp.range, adapt=TRUE, maxIter=100, svd.decompose=TRUE, return.grid=TRUE, ncores=1, nfolds=10) str(cv1) ## End(Not run)
## Not run: ### load plsgenomics library library(plsgenomics) ### generating data n <- 100 p <- 100 nclass <- 3 sample1 <- sample.multinom(n=n, p=p, nb.class=nclass, kstar=10, lstar=2, beta.min=0.25, beta.max=0.75, mean.H=0.2, sigma.H=10, sigma.F=5) X <- sample1$X Y <- sample1$Y ### hyper-parameters values to test lambda.l1.range <- seq(0.05,0.95,by=0.1) # between 0 and 1 ncomp.range <- 1:10 # log-linear range between 0.01 a,d 1000 for lambda.ridge.range logspace <- function( d1, d2, n) exp(log(10)*seq(d1, d2, length.out=n)) lambda.ridge.range <- signif(logspace(d1 <- -2, d2 <- 3, n=21), digits=3) ### tuning the hyper-parameters cv1 <- multinom.spls.cv(X=X, Y=Y, lambda.ridge.range=lambda.ridge.range, lambda.l1.range=lambda.l1.range, ncomp.range=ncomp.range, adapt=TRUE, maxIter=100, svd.decompose=TRUE, return.grid=TRUE, ncores=1, nfolds=10) str(cv1) ## End(Not run)
The function multinom.spls.stab
train a multinomial-spls model for
each candidate values (ncomp, lambda.l1, lambda.ridge)
of
hyper-parameters on multiple sub-samplings in the data. The stability
selection procedure selects the covariates that are selected by most of the
models among the grid of hyper-parameters, following the procedure
described in Durif et al. (2018). Candidates values for ncomp
,
lambda.l1
and lambda.l2
are respectively given by
the input arguments ncomp.range
, lambda.l1.range
and lambda.l2.range
.
multinom.spls.stab( X, Y, lambda.ridge.range, lambda.l1.range, ncomp.range, adapt = TRUE, maxIter = 100, svd.decompose = TRUE, ncores = 1, nresamp = 100, center.X = TRUE, scale.X = FALSE, weighted.center = TRUE, seed = NULL, verbose = TRUE )
multinom.spls.stab( X, Y, lambda.ridge.range, lambda.l1.range, ncomp.range, adapt = TRUE, maxIter = 100, svd.decompose = TRUE, ncores = 1, nresamp = 100, center.X = TRUE, scale.X = FALSE, weighted.center = TRUE, seed = NULL, verbose = TRUE )
X |
a (n x p) data matrix of predictors. |
Y |
a (n) vector of (continuous) responses. |
lambda.ridge.range |
a vector of positive real values.
|
lambda.l1.range |
a vecor of positive real values, in [0,1].
|
ncomp.range |
a vector of positive integers. |
adapt |
a boolean value, indicating whether the sparse PLS selection step sould be adaptive or not (see details). |
maxIter |
a positive integer, the maximal number of iterations in the RIRLS algorithm (see details). |
svd.decompose |
a boolean parameter. |
ncores |
a positve integer, indicating the number of cores that the cross-validation is allowed to use for parallel computation (see details). |
nresamp |
number of resamplings of the data to estimate the probility of selection for each covariate, default is 100. |
center.X |
a boolean value indicating whether the data matrices
|
scale.X |
a boolean value indicating whether the data matrices
|
weighted.center |
a boolean value indicating whether the centering should take into account the weighted l2 metric or not in the SPLS step. |
seed |
a positive integer value (default is NULL). If non NULL, the seed for pseudo-random number generation is set accordingly. |
verbose |
a boolean parameter indicating the verbosity. |
The columns of the data matrices X
may not be standardized,
since standardizing is performed by the function multinom.spls.stab
as a preliminary step.
The procedure is described in Durif et al. (2018). The stability selection procedure can be summarize as follow (c.f. Meinshausen and Buhlmann, 2010).
(i) For each candidate values (ncomp, lambda.l1, lambda.ridge)
of
hyper-parameters, a multinomial-spls is trained on nresamp
resamplings of the data. Then, for each triplet
(ncomp, lambda.l1, lambda.ridge)
, the probability that a covariate
(i.e. a column in X
) is selected is computed among the resamplings.
The estimated probabilities can be visualized as a heatmap with the
function stability.selection.heatmap
.
(ii) Eventually, the set of "stable selected" variables corresponds to the set of covariates that were selected by most of the training among the grid of hyper-parameters candidate values.
This function achieves the first step (i) of the stability selection
procedure. The second step (ii) is achieved by the function
stability.selection
This procedures uses mclapply
from the parallel
package,
available on GNU/Linux and MacOS. Users of Microsoft Windows can refer to
the README file in the source to be able to use a mclapply type function.
An object with the following attributes
q.Lambda |
A table with values of q.Lambda (c.f. Durif et al. (2018) for the notation), being the averaged number of covariates selected among the entire grid of hyper-parameters candidates values, for increasing size of hyper-parameter grid. |
probs.lambda |
A table with estimated probability of selection for each covariates depending on the candidates values for hyper-parameters. |
p |
An integer values indicating the number of covariates in the model. |
Ghislain Durif (https://gdurif.perso.math.cnrs.fr/).
Durif, G., Modolo, L., Michaelsson, J., Mold, J.E., Lambert-Lacroix, S., Picard, F., 2018. High dimensional classification with combined adaptive sparse PLS and logistic regression. Bioinformatics 34, 485–493. doi:10.1093/bioinformatics/btx571. Available at http://arxiv.org/abs/1502.05933.
Meinshausen, N., Buhlmann P. (2010). Stability Selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72, no. 4, 417-473.
multinom.spls
, stability.selection
,
stability.selection.heatmap
## Not run: ### load plsgenomics library library(plsgenomics) ### generating data n <- 100 p <- 100 nclass <- 3 sample1 <- sample.multinom(n, p, nb.class=nclass, kstar=20, lstar=2, beta.min=0.25, beta.max=0.75, mean.H=0.2, sigma.H=10, sigma.F=5) X <- sample1$X Y <- sample1$Y ### pertinent covariates id sample1$sel ### hyper-parameters values to test lambda.l1.range <- seq(0.05,0.95,by=0.1) # between 0 and 1 ncomp.range <- 1:10 # log-linear range between 0.01 a,d 1000 for lambda.ridge.range logspace <- function( d1, d2, n) exp(log(10)*seq(d1, d2, length.out=n)) lambda.ridge.range <- signif(logspace(d1 <- -2, d2 <- 3, n=21), digits=3) ### tuning the hyper-parameters stab1 <- multinom.spls.stab(X=X, Y=Y, lambda.ridge.range=lambda.ridge.range, lambda.l1.range=lambda.l1.range, ncomp.range=ncomp.range, adapt=TRUE, maxIter=100, svd.decompose=TRUE, ncores=1, nresamp=100) str(stab1) ### heatmap of estimated probabilities stability.selection.heatmap(stab1) ### selected covariates stability.selection(stab1, piThreshold=0.6, rhoError=10) ## End(Not run)
## Not run: ### load plsgenomics library library(plsgenomics) ### generating data n <- 100 p <- 100 nclass <- 3 sample1 <- sample.multinom(n, p, nb.class=nclass, kstar=20, lstar=2, beta.min=0.25, beta.max=0.75, mean.H=0.2, sigma.H=10, sigma.F=5) X <- sample1$X Y <- sample1$Y ### pertinent covariates id sample1$sel ### hyper-parameters values to test lambda.l1.range <- seq(0.05,0.95,by=0.1) # between 0 and 1 ncomp.range <- 1:10 # log-linear range between 0.01 a,d 1000 for lambda.ridge.range logspace <- function( d1, d2, n) exp(log(10)*seq(d1, d2, length.out=n)) lambda.ridge.range <- signif(logspace(d1 <- -2, d2 <- 3, n=21), digits=3) ### tuning the hyper-parameters stab1 <- multinom.spls.stab(X=X, Y=Y, lambda.ridge.range=lambda.ridge.range, lambda.l1.range=lambda.l1.range, ncomp.range=ncomp.range, adapt=TRUE, maxIter=100, svd.decompose=TRUE, ncores=1, nresamp=100) str(stab1) ### heatmap of estimated probabilities stability.selection.heatmap(stab1) ### selected covariates stability.selection(stab1, piThreshold=0.6, rhoError=10) ## End(Not run)
The function pls.lda
performs binary or multicategorical classification using the method
described in Boulesteix (2004) which consists in PLS dimension reduction and linear
discriminant analysis applied on the PLS components.
pls.lda(Xtrain, Ytrain, Xtest=NULL, ncomp, nruncv=0, alpha=2/3, priors=NULL)
pls.lda(Xtrain, Ytrain, Xtest=NULL, ncomp, nruncv=0, alpha=2/3, priors=NULL)
Xtrain |
a (ntrain x p) data matrix containing the predictors for the training data set. Xtrain may be a matrix or a data frame. Each row is an observation and each column is a predictor variable. |
Ytrain |
a vector of length ntrain giving the classes of the ntrain observations. The classes must be coded as 1,...,K (K>=2). |
Xtest |
a (ntest x p) data matrix containing the predictors for the test
data set. |
ncomp |
if |
nruncv |
the number of cross-validation iterations to be performed for the choice of
the number of latent components. If |
alpha |
the proportion of observations to be included in the training set at each cross-validation iteration. |
priors |
The class priors to be used for linear discriminant analysis. If unspecified, the class proportions in the training set are used. |
The function pls.lda
proceeds as follows to predict the class of the
observations from the test data set.
First, the SIMPLS algorithm is run on Xtrain
and Ytrain
to
determine the new PLS components based on the training observations only.
The new PLS components are then computed for the test
data set. Classification is performed by applying classical linear
discriminant analysis (LDA) to the new components. Of course, the LDA
classifier is built using the training observations only.
A list with the following components:
predclass |
the vector containing the predicted classes of the ntest observations from
|
ncomp |
the number of latent components used for classification. |
pls.out |
an object containing the results from the call of the |
lda.out |
an object containing the results from the call of the |
pred.lda.out |
an object containing the results from the call of the |
Anne-Laure Boulesteix (https://www.ibe.med.uni-muenchen.de/mitarbeiter/professoren/boulesteix/index.html)
A. L. Boulesteix (2004). PLS dimension reduction for classification with microarray data, Statistical Applications in Genetics and Molecular Biology 3, Issue 1, Article 33.
A. L. Boulesteix, K. Strimmer (2007). Partial least squares: a versatile tool for the analysis of high-dimensional genomic data. Briefings in Bioinformatics 7:32-44.
S. de Jong (1993). SIMPLS: an alternative approach to partial least squares regression, Chemometrics Intell. Lab. Syst. 18, 251–263.
pls.regression
, variable.selection
,
pls.lda.cv
.
# load plsgenomics library library(plsgenomics) # load leukemia data data(leukemia) # Classify observations 1,2,3 (test set) using observations 4 to 38 (training set), # with 2 PLS components pls.lda(Xtrain=leukemia$X[-(1:3),],Ytrain=leukemia$Y[-(1:3)],Xtest=leukemia$X[1:3,], ncomp=2,nruncv=0) # Classify observations 1,2,3 (test set) using observations 4 to 38 (training set), # with the best number of components as determined by cross-validation pls.lda(Xtrain=leukemia$X[-(1:3),],Ytrain=leukemia$Y[-(1:3)],Xtest=leukemia$X[1:3,], ncomp=1:4,nruncv=20)
# load plsgenomics library library(plsgenomics) # load leukemia data data(leukemia) # Classify observations 1,2,3 (test set) using observations 4 to 38 (training set), # with 2 PLS components pls.lda(Xtrain=leukemia$X[-(1:3),],Ytrain=leukemia$Y[-(1:3)],Xtest=leukemia$X[1:3,], ncomp=2,nruncv=0) # Classify observations 1,2,3 (test set) using observations 4 to 38 (training set), # with the best number of components as determined by cross-validation pls.lda(Xtrain=leukemia$X[-(1:3),],Ytrain=leukemia$Y[-(1:3)],Xtest=leukemia$X[1:3,], ncomp=1:4,nruncv=20)
The function pls.lda.cv
determines the best number of latent components to be used for
classification with PLS dimension reduction and linear discriminant analysis as described in
Boulesteix (2004).
pls.lda.cv(Xtrain, Ytrain, ncomp, nruncv=20, alpha=2/3, priors=NULL)
pls.lda.cv(Xtrain, Ytrain, ncomp, nruncv=20, alpha=2/3, priors=NULL)
Xtrain |
a (ntrain x p) data matrix containing the predictors for the training data set.
|
Ytrain |
a vector of length ntrain giving the classes of the ntrain observations. The classes must be coded as 1,...,K (K>=2). |
ncomp |
the vector of integers from which the best number of latent
components has to be chosen by cross-validation. If |
nruncv |
the number of cross-validation iterations to be performed for the choice of the number of latent components. |
alpha |
the proportion of observations to be included in the training set at each cross-validation iteration. |
priors |
The class priors to be used for linear discriminant analysis. If unspecified, the class proportions in the training set are used. |
The cross-validation procedure described in Boulesteix (2004) is used to
determine the best number of latent components to be used for classification.
At each cross-validation run, Xtrain
is split into a pseudo training
set and a pseudo test set and the classification error rate is determined for each
number of latent components. Finally, the function pls.lda.cv
returns
the number of latent components for which the mean classification rate over
the nrun
partitions is minimal.
The number of latent components to be used for classification.
Anne-Laure Boulesteix (https://www.ibe.med.uni-muenchen.de/mitarbeiter/professoren/boulesteix/index.html)
A. L. Boulesteix (2004). PLS dimension reduction for classification with microarray data, Statistical Applications in Genetics and Molecular Biology 3, Issue 1, Article 33.
A. L. Boulesteix, K. Strimmer (2007). Partial least squares: a versatile tool for the analysis of high-dimensional genomic data. Briefings in Bioinformatics 7:32-44.
S. de Jong (1993). SIMPLS: an alternative approach to partial least squares regression, Chemometrics Intell. Lab. Syst. 18, 251–263.
## Not run: ## between 5~15 seconds # load plsgenomics library library(plsgenomics) # load leukemia data data(leukemia) # Determine the best number of components to be used for classification using the # cross-validation procedure # choose the best number from 2,3,4 pls.lda.cv(Xtrain=leukemia$X,Ytrain=leukemia$Y,ncomp=2:4,nruncv=20) # choose the best number from 1,2,3 pls.lda.cv(Xtrain=leukemia$X,Ytrain=leukemia$Y,ncomp=3,nruncv=20) ## End(Not run)
## Not run: ## between 5~15 seconds # load plsgenomics library library(plsgenomics) # load leukemia data data(leukemia) # Determine the best number of components to be used for classification using the # cross-validation procedure # choose the best number from 2,3,4 pls.lda.cv(Xtrain=leukemia$X,Ytrain=leukemia$Y,ncomp=2:4,nruncv=20) # choose the best number from 1,2,3 pls.lda.cv(Xtrain=leukemia$X,Ytrain=leukemia$Y,ncomp=3,nruncv=20) ## End(Not run)
The function pls.regression
performs pls multivariate regression (with several response
variables and several predictor variables) using de Jong's SIMPLS algorithm. This function
is an adaptation of R. Wehrens' code from the package pls.pcr.
pls.regression(Xtrain, Ytrain, Xtest=NULL, ncomp=NULL, unit.weights=TRUE)
pls.regression(Xtrain, Ytrain, Xtest=NULL, ncomp=NULL, unit.weights=TRUE)
Xtrain |
a (ntrain x p) data matrix of predictors. |
Ytrain |
a (ntrain x m) data matrix of responses. |
Xtest |
a (ntest x p) matrix containing the predictors for the test data
set. |
ncomp |
the number of latent components to be used for regression. If
|
unit.weights |
if |
The columns of the data matrices Xtrain
and Ytrain
must not be centered to have
mean zero, since centering is performed by the function pls.regression
as a preliminary
step before the SIMPLS algorithm is run.
In the original definition of SIMPLS by de Jong (1993), the weight vectors have length 1. If the weight vectors are standardized to have length 1, they satisfy a simple optimality criterion (de Jong, 1993). However, it is also usual (and computationally efficient) to standardize the latent components to have length 1.
In contrast to the original version found in the package pls.pcr
,
the prediction for the observations from Xtest
is performed after
centering the columns of Xtest
by substracting the columns means
calculated from Xtrain
.
A list with the following components:
B |
the (p x m x length( |
Ypred |
the (ntest x m x length( |
P |
the (p x max( |
Q |
the (m x max( |
T |
the (ntrain x max( |
R |
the (p x max( |
meanX |
the p-vector containing the means of the columns of |
Anne-Laure Boulesteix (https://www.ibe.med.uni-muenchen.de/mitarbeiter/professoren/boulesteix/index.html) and Korbinian Strimmer (https://strimmerlab.github.io/korbinian.html).
Adapted in part from pls.pcr code by R. Wehrens (in a former version of the 'pls' package https://CRAN.R-project.org/package=pls).
S. de Jong (1993). SIMPLS: an alternative approach to partial least squares regression, Chemometrics Intell. Lab. Syst. 18, 251–263.
C. J. F. ter Braak and S. de Jong (1993). The objective function of partial least squares regression, Journal of Chemometrics 12, 41–54.
pls.lda
, TFA.estimate
,
pls.regression.cv
.
# load plsgenomics library library(plsgenomics) # load the Ecoli data data(Ecoli) # perform pls regression # with unit latent components pls.regression(Xtrain=Ecoli$CONNECdata,Ytrain=Ecoli$GEdata,Xtest=Ecoli$CONNECdata, ncomp=1:3,unit.weights=FALSE) # with unit weight vectors pls.regression(Xtrain=Ecoli$CONNECdata,Ytrain=Ecoli$GEdata,Xtest=Ecoli$CONNECdata, ncomp=1:3,unit.weights=TRUE)
# load plsgenomics library library(plsgenomics) # load the Ecoli data data(Ecoli) # perform pls regression # with unit latent components pls.regression(Xtrain=Ecoli$CONNECdata,Ytrain=Ecoli$GEdata,Xtest=Ecoli$CONNECdata, ncomp=1:3,unit.weights=FALSE) # with unit weight vectors pls.regression(Xtrain=Ecoli$CONNECdata,Ytrain=Ecoli$GEdata,Xtest=Ecoli$CONNECdata, ncomp=1:3,unit.weights=TRUE)
The function pls.regression.cv
determines the best number of latent components to be used
for PLS regression using the cross-validation approach described in Boulesteix and Strimmer (2005).
pls.regression.cv(Xtrain, Ytrain, ncomp, nruncv=20, alpha=2/3)
pls.regression.cv(Xtrain, Ytrain, ncomp, nruncv=20, alpha=2/3)
Xtrain |
a (ntrain x p) data matrix containing the predictors for the training data set.
|
Ytrain |
a (ntrain x m) data matrix of responses. |
ncomp |
the vector of integers from which the best number of latent
components has to be chosen by cross-validation. If |
nruncv |
the number of cross-validation iterations to be performed for the choice of the number of latent components. |
alpha |
the proportion of observations to be included in the training set at each cross-validation iteration. |
The cross-validation procedure described in Boulesteix and Strimmer (2005)
is used to determine the best number of latent components to be used for classification.
At each cross-validation run, Xtrain
is split into a pseudo training
set and a pseudo test set and the squared error is determined for each
number of latent components. Finally, the function pls.regression.cv
returns
the number of latent components for which the mean squared error over
the nrun
partitions is minimal.
The number of latent components to be used in PLS regression, as determined by cross-validation.
Anne-Laure Boulesteix (https://www.ibe.med.uni-muenchen.de/mitarbeiter/professoren/boulesteix/index.html) and Korbinian Strimmer (https://strimmerlab.github.io/korbinian.html).
A. L. Boulesteix and K. Strimmer (2005). Predicting Transcription Factor Activities from Combined Analysis of Microarray and ChIP Data: A Partial Least Squares Approach.
A. L. Boulesteix, K. Strimmer (2007). Partial least squares: a versatile tool for the analysis of high-dimensional genomic data. Briefings in Bioinformatics 7:32-44.
S. de Jong (1993). SIMPLS: an alternative approach to partial least squares regression, Chemometrics Intell. Lab. Syst. 18, 251–263.
pls.regression
, TFA.estimate
,
pls.lda.cv
.
## Not run: ## between 5~15 seconds # load plsgenomics library library(plsgenomics) # load Ecoli data data(Ecoli) # determine the best number of components for PLS regression using the cross-validation approach # choose the best number from 1,2,3,4 pls.regression.cv(Xtrain=Ecoli$CONNECdata,Ytrain=Ecoli$GEdata,ncomp=4,nruncv=20) # choose the best number from 2,3 pls.regression.cv(Xtrain=Ecoli$CONNECdata,Ytrain=Ecoli$GEdata,ncomp=c(2,3),nruncv=20) ## End(Not run)
## Not run: ## between 5~15 seconds # load plsgenomics library library(plsgenomics) # load Ecoli data data(Ecoli) # determine the best number of components for PLS regression using the cross-validation approach # choose the best number from 1,2,3,4 pls.regression.cv(Xtrain=Ecoli$CONNECdata,Ytrain=Ecoli$GEdata,ncomp=4,nruncv=20) # choose the best number from 2,3 pls.regression.cv(Xtrain=Ecoli$CONNECdata,Ytrain=Ecoli$GEdata,ncomp=c(2,3),nruncv=20) ## End(Not run)
These functions are provided for compatibility with older version of the 'plsgenomics' package. They may eventually be completely removed.
m.rirls.spls(...)
m.rirls.spls(...)
... |
Parameters to be passed to the modern version of the function |
rirls.spls |
is replaced by logit.spls
|
rirls.spls.tune |
is replaced by logit.spls.cv
|
rirls.spls.stab |
is replaced by logit.spls.stab
|
m.rirls.spls |
is replaced by multinom.spls
|
m.rirls.spls.tune |
is replaced by multinom.spls.cv
|
m.rirls.spls.stab |
is replaced by multinom.spls.stab
|
spls.adapt |
is replaced by spls
|
spls.adapt.tune |
is replaced by spls.cv
|
The function preprocess
performs a preprocessing of microarray data.
preprocess(Xtrain, Xtest=NULL,Threshold=c(100,16000),Filtering=c(5,500), log10.scale=TRUE,row.stand=TRUE)
preprocess(Xtrain, Xtest=NULL,Threshold=c(100,16000),Filtering=c(5,500), log10.scale=TRUE,row.stand=TRUE)
Xtrain |
a (ntrain x p) data matrix of predictors. |
Xtest |
a (ntest x p) matrix containing the predictors for the test data
set. |
Threshold |
a vector of length 2 containing the values (threshmin,threshmax) for
thresholding data in preprocess. Data is thresholded to value threshmin and ceiled to value
threshmax. If |
Filtering |
a vector of length 2 containing the values (FiltMin,FiltMax) for filtering genes
in preprocess. Genes with max/min$<= FiltMin$ and (max-min)$<= FiltMax$ are excluded.
If |
log10.scale |
a logical value equal to TRUE if a log10-transformation has to be done. |
row.stand |
a logical value equal to TRUE if a standardisation in row has to be done. |
The pre-processing steps recommended by Dudoit et al. (2002) are performed. The default values are those adapted for Colon data.
A list with the following components:
pXtrain |
the (ntrain x p') matrix containing the preprocessed train data. |
pXtest |
the (ntest x p') matrix containing the preprocessed test data. |
Sophie Lambert-Lacroix (http://membres-timc.imag.fr/Sophie.Lambert/) and Julie Peyre (https://membres-ljk.imag.fr/Julie.Peyre/).
Dudoit, S. and Fridlyand, J. and Speed, T. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data, Journal of the American Statistical Association, 97, 77–87.
# load plsgenomics library library(plsgenomics) # load Colon data data(Colon) IndexLearn <- c(sample(which(Colon$Y==2),27),sample(which(Colon$Y==1),14)) Xtrain <- Colon$X[IndexLearn,] Ytrain <- Colon$Y[IndexLearn] Xtest <- Colon$X[-IndexLearn,] # preprocess data resP <- preprocess(Xtrain= Xtrain, Xtest=Xtest,Threshold = c(100,16000),Filtering=c(5,500), log10.scale=TRUE,row.stand=TRUE) # how many genes after preprocess ? dim(resP$pXtrain)[2]
# load plsgenomics library library(plsgenomics) # load Colon data data(Colon) IndexLearn <- c(sample(which(Colon$Y==2),27),sample(which(Colon$Y==1),14)) Xtrain <- Colon$X[IndexLearn,] Ytrain <- Colon$Y[IndexLearn] Xtest <- Colon$X[-IndexLearn,] # preprocess data resP <- preprocess(Xtrain= Xtrain, Xtest=Xtest,Threshold = c(100,16000),Filtering=c(5,500), log10.scale=TRUE,row.stand=TRUE) # how many genes after preprocess ? dim(resP$pXtrain)[2]
The function mrpls
performs prediction using Fort and Lambert-Lacroix (2005) RPLS
algorithm.
rpls(Ytrain,Xtrain,Lambda,ncomp,Xtest=NULL,NbIterMax=50)
rpls(Ytrain,Xtrain,Lambda,ncomp,Xtest=NULL,NbIterMax=50)
Xtrain |
a (ntrain x p) data matrix of predictors. |
Ytrain |
a ntrain vector of responses. |
Xtest |
a (ntest x p) matrix containing the predictors for the test data
set. |
Lambda |
a positive real value. |
ncomp |
a positive integer. |
NbIterMax |
a positive integer. |
The columns of the data matrices Xtrain
and Xtest
may not be standardized,
since standardizing is performed by the function rpls
as a preliminary step
before the algorithm is run.
The procedure described in Fort and Lambert-Lacroix (2005) is used to determine
latent components to be used for classification and when Xtest
is not equal to NULL, the procedure predicts the labels for these new
predictor variables.
A list with the following components:
Coefficients |
the (p+1) vector containing the coefficients weighting the design matrix. |
hatY |
the ntrain vector containing the estimated {0,1}-valued labels for the
observations from |
hatYtest |
the ntest vector containing the predicted {0,1}-valued labels for the
observations from |
proba |
the ntrain vector containing the estimated probabilities for the
observations from |
proba.test |
the ntest vector containing the predicted probabilities for the
observations from |
DeletedCol |
the vector containing the column number of |
hatYtest_k |
If |
Sophie Lambert-Lacroix (http://membres-timc.imag.fr/Sophie.Lambert/).
G. Fort and S. Lambert-Lacroix (2005). Classification using Partial Least Squares with Penalized Logistic Regression, Bioinformatics, vol 21, n 8, 1104-1111.
# load plsgenomics library library(plsgenomics) # load Colon data data(Colon) IndexLearn <- c(sample(which(Colon$Y==2),12),sample(which(Colon$Y==1),8)) # preprocess data res <- preprocess(Xtrain= Colon$X[IndexLearn,], Xtest=Colon$X[-IndexLearn,], Threshold = c(100,16000),Filtering=c(5,500), log10.scale=TRUE,row.stand=TRUE) # the results are given in res$pXtrain and res$pXtest # perform prediction by RPLS resrpls <- rpls(Ytrain=Colon$Y[IndexLearn]-1,Xtrain=res$pXtrain,Lambda=0.6,ncomp=1,Xtest=res$pXtest) resrpls$hatY sum(resrpls$Ytest!=Colon$Y[-IndexLearn]) # prediction for another sample Xnew <- res$pXtest[1,] # Compute the linear predictor for each classes expect class 0 eta <- c(1,Xnew) %*% resrpls$Coefficients Ypred <- which.max(c(0,eta)) Ypred+1
# load plsgenomics library library(plsgenomics) # load Colon data data(Colon) IndexLearn <- c(sample(which(Colon$Y==2),12),sample(which(Colon$Y==1),8)) # preprocess data res <- preprocess(Xtrain= Colon$X[IndexLearn,], Xtest=Colon$X[-IndexLearn,], Threshold = c(100,16000),Filtering=c(5,500), log10.scale=TRUE,row.stand=TRUE) # the results are given in res$pXtrain and res$pXtest # perform prediction by RPLS resrpls <- rpls(Ytrain=Colon$Y[IndexLearn]-1,Xtrain=res$pXtrain,Lambda=0.6,ncomp=1,Xtest=res$pXtest) resrpls$hatY sum(resrpls$Ytest!=Colon$Y[-IndexLearn]) # prediction for another sample Xnew <- res$pXtest[1,] # Compute the linear predictor for each classes expect class 0 eta <- c(1,Xnew) %*% resrpls$Coefficients Ypred <- which.max(c(0,eta)) Ypred+1
The function rpls.cv
determines the best ridge regularization parameter and the best
number of PLS components to be used for classification for Fort and Lambert-Lacroix (2005)
RPLS algorithm.
rpls.cv(Ytrain, Xtrain, LambdaRange, ncompMax, NbIterMax=50, ncores=1)
rpls.cv(Ytrain, Xtrain, LambdaRange, ncompMax, NbIterMax=50, ncores=1)
Xtrain |
a (ntrain x p) data matrix of predictors. |
Ytrain |
a ntrain vector of responses. |
LambdaRange |
the vector of positive real value from which the best ridge regularization parameter has to be chosen by cross-validation. |
ncompMax |
a positive integer. the best number of components is chosen from
1,..., |
NbIterMax |
a positive integer. |
ncores |
a positive integer. The number of cores to be used for parallel computing (if different from 1) |
A cross-validation procedure is used to determine the best ridge regularization parameter and
number of PLS components to be used for classification with RPLS for binary data
(for categorical data see mrpls
and mrpls.cv
).
At each cross-validation run, Xtrain
is split into a pseudo training
set (ntrain-1 samples) and a pseudo test set (1 sample) and the classification error rate is
determined for each value of ridge regularization parameter and number of components. Finally,
the function mrpls.cv
returns the values of the ridge regularization parameter and
bandwidth for which the mean classification error rate is minimal.
A list with the following components:
Lambda |
the optimal regularization parameter. |
ncomp |
the optimal number of PLS components. |
Sophie Lambert-Lacroix (http://membres-timc.imag.fr/Sophie.Lambert/).
G. Fort and S. Lambert-Lacroix (2005). Classification using Partial Least Squares with Penalized Logistic Regression, Bioinformatics, vol 21, n 8, 1104-1111.
## Not run: ## between 5~15 seconds # load plsgenomics library library(plsgenomics) # load Colon data data(Colon) IndexLearn <- c(sample(which(Colon$Y==2),12),sample(which(Colon$Y==1),8)) # preprocess data res <- preprocess(Xtrain= Colon$X[IndexLearn,], Xtest=Colon$X[-IndexLearn,], Threshold = c(100,16000),Filtering=c(5,500), log10.scale=TRUE,row.stand=TRUE) # the results are given in res$pXtrain and res$pXtest # Determine optimum ncomp and lambda nl <- rpls.cv(Ytrain=Colon$Y[IndexLearn]-1,Xtrain=res$pXtrain,LambdaRange=c(0.1,1),ncompMax=3) # perform prediction by RPLS resrpls <- rpls(Ytrain=Colon$Y[IndexLearn]-1,Xtrain=res$pXtrain,Lambda=nl$Lambda, ncomp=nl$ncomp,Xtest=res$pXtest) sum(resrpls$Ytest!=Colon$Y[-IndexLearn]-1) ## End(Not run)
## Not run: ## between 5~15 seconds # load plsgenomics library library(plsgenomics) # load Colon data data(Colon) IndexLearn <- c(sample(which(Colon$Y==2),12),sample(which(Colon$Y==1),8)) # preprocess data res <- preprocess(Xtrain= Colon$X[IndexLearn,], Xtest=Colon$X[-IndexLearn,], Threshold = c(100,16000),Filtering=c(5,500), log10.scale=TRUE,row.stand=TRUE) # the results are given in res$pXtrain and res$pXtest # Determine optimum ncomp and lambda nl <- rpls.cv(Ytrain=Colon$Y[IndexLearn]-1,Xtrain=res$pXtrain,LambdaRange=c(0.1,1),ncompMax=3) # perform prediction by RPLS resrpls <- rpls(Ytrain=Colon$Y[IndexLearn]-1,Xtrain=res$pXtrain,Lambda=nl$Lambda, ncomp=nl$ncomp,Xtest=res$pXtest) sum(resrpls$Ytest!=Colon$Y[-IndexLearn]-1) ## End(Not run)
The function sample.bin
generates a random sample of n observations,
composed of p predictors, collected in the n x p matrix X, and a binary
response, in a vector Y of length n, thanks to a logistic model, where the
response Y is generated as a Bernoulli random variable of parameter
logit^{-1}(XB), the coefficients B are sparse. In addition, the covariate
matrix X is composed of correlated blocks of predictors.
sample.bin( n, p, kstar, lstar, beta.min, beta.max, mean.H = 0, sigma.H = 1, sigma.F = 1, seed = NULL )
sample.bin( n, p, kstar, lstar, beta.min, beta.max, mean.H = 0, sigma.H = 1, sigma.F = 1, seed = NULL )
n |
the number of observations in the sample. |
p |
the number of covariates in the sample. |
kstar |
the number of underlying latent variables used to generates
the covariate matrix |
lstar |
the number of blocks in the covariate matrix |
beta.min |
the inf bound for non null coefficients (see details). |
beta.max |
the sup bound for non null coefficients (see details). |
mean.H |
the mean of latent variables used to generates |
sigma.H |
the standard deviation of latent variables used to
generates |
sigma.F |
the standard deviation of the noise added to latent
variables used to generates |
seed |
an positive integer, if non NULL it fix the seed (with the
command |
The set (1:p) of predictors is partitioned into kstar block. Each block k (k=1,...,kstar) depends on a latent variable H.k which are independent and identically distributed following a Gaussian distribution N(mean.H, sigma.H^2). Each columns X.j of the matrix X is generated as H.k + F.j for j in the block k, where F.j is independent and identically distributed gaussian noise N(0,sigma.F^2).
The coefficients B are generated as random between beta.min and beta.max on lstar blocks, randomly chosen, and null otherwise. The variables with non null coefficients are then relevant to explain the response, whereas the ones with null coefficients are not.
The response is generated as by drawing one observation of n different Bernoulli random variables of parameters logit^{-1}(XB).
The details of the procedure are developped by Durif et al. (2018).
An object with the following attributes:
X |
the (n x p) covariate matrix, containing the |
Y |
the (n) vector of Y observations. |
proba |
the n vector of Bernoulli parameters used to generate the
response, in particular |
sel |
the index in (1:p) of covariates with non null coefficients
in |
nosel |
the index in (1:p) of covariates with null coefficients
in |
B |
the (n) vector of coefficients. |
block.partition |
a (p) vector indicating the block of each predictors in (1:kstar). |
p |
the number of covariates in the sample. |
kstar |
the number of underlying latent variables used to generates
the covariate matrix |
lstar |
the number of blocks in the covariate matrix |
p0 |
the number of predictors with non null coefficients in |
block.sel |
a (lstar) vector indicating the index in (1:kstar) of
blocks with predictors having non null coefficient in |
beta.min |
the inf bound for non null coefficients (see details). |
beta.max |
the sup bound for non null coefficients (see details). |
mean.H |
the mean of latent variables used to generates |
sigma.H |
the standard deviation of latent variables used to
generates |
sigma.F |
the standard deviation of the noise added to latent
variables used to generates |
seed |
an positive integer, if non NULL it fix the seed
(with the command |
Ghislain Durif (https://gdurif.perso.math.cnrs.fr/).
Durif, G., Modolo, L., Michaelsson, J., Mold, J.E., Lambert-Lacroix, S., Picard, F., 2018. High dimensional classification with combined adaptive sparse PLS and logistic regression. Bioinformatics 34, 485–493. doi:10.1093/bioinformatics/btx571. Available at http://arxiv.org/abs/1502.05933.
### load plsgenomics library library(plsgenomics) ### generating data n <- 100 p <- 1000 sample1 <- sample.bin(n=n, p=p, kstar=20, lstar=2, beta.min=0.25, beta.max=0.75, mean.H=0.2, sigma.H=10, sigma.F=5) str(sample1)
### load plsgenomics library library(plsgenomics) ### generating data n <- 100 p <- 1000 sample1 <- sample.bin(n=n, p=p, kstar=20, lstar=2, beta.min=0.25, beta.max=0.75, mean.H=0.2, sigma.H=10, sigma.F=5) str(sample1)
The function sample.cont
generates a random sample with p predictors X, a response Y,
and n observations, through a linear model Y=XB+E, where the noise E is gaussian,
the coefficients B are sparse, and the design matrix X is composed of correlated blocks of
predictors.
sample.cont(n, p, kstar, lstar, beta.min, beta.max, mean.H=0, sigma.H, sigma.F, sigma.E, seed=NULL)
sample.cont(n, p, kstar, lstar, beta.min, beta.max, mean.H=0, sigma.H, sigma.F, sigma.E, seed=NULL)
n |
the number of observations in the sample. |
p |
the number of covariates in the sample. |
kstar |
the number of underlying latent variables used to generates the design matrix
|
lstar |
the number of blocks in the design matrix |
beta.min |
the inf bound for non null coefficients (see details). |
beta.max |
the sup bound for non null coefficients (see details). |
mean.H |
the mean of latent variables used to generates |
sigma.H |
the standard deviation of latent variables used to generates |
sigma.F |
the standard deviation of the noise added to latent variables used to
generates |
sigma.E |
the standard deviation of the noise in the linear model
|
seed |
an positive integer, if non NULL it fix the seed (with the command
|
The set (1:p) of predictors is partitioned into kstar block. Each block k (k=1,...,kstar) depends on a latent variable H.k which are independent and identically distributed following a distribution N(mean.H, sigma.H^2). Each columns X.j of the matrix X is generated as H.k + F.j for j in the block k, where F.j is independent and identically distributed gaussian noise N(0,sigma.F^2).
The coefficients B are generated as random between beta.min and beta.max on lstar blocks, randomly chosen, and null otherwise. The variables with non null coefficients are then relevant to explain the response, whereas the ones with null coefficients are not.
The response is generated as Y = X %*% B + E, where E is some gaussian noise N(0,sigma.E^2).
The details of the procedure are developped by Durif et al. (2018).
A list with the following components:
X |
the (n x p) design matrix, containing the |
Y |
the (n) vector of Y observations. |
residuals |
the (n) vector corresponding to the noise |
sel |
the index in (1:p) of covariates with non null coefficients in |
nosel |
the index in (1:p) of covariates with null coefficients in |
B |
the (n) vector of coefficients. |
block.partition |
a (p) vector indicating the block of each predictors in (1:kstar). |
p |
the number of covariates in the sample. |
kstar |
the number of underlying latent variables used to generates the design matrix
|
lstar |
the number of blocks in the design matrix |
p0 |
the number of predictors with non null coefficients in |
block.sel |
a (lstar) vector indicating the index in (1:kstar) of blocks with predictors
having non null coefficient in |
beta.min |
the inf bound for non null coefficients (see details). |
beta.max |
the sup bound for non null coefficients (see details). |
mean.H |
the mean of latent variables used to generates |
sigma.H |
the standard deviation of latent variables used to generates |
sigma.F |
the standard deviation of the noise added to latent variables used to
generates |
sigma.E |
the standard deviation of the noise in the linear model. |
seed |
an positive integer, if non NULL it fix the seed (with the command
|
Ghislain Durif (https://gdurif.perso.math.cnrs.fr/).
Durif, G., Modolo, L., Michaelsson, J., Mold, J.E., Lambert-Lacroix, S., Picard, F., 2018. High dimensional classification with combined adaptive sparse PLS and logistic regression. Bioinformatics 34, 485–493. doi:10.1093/bioinformatics/btx571. Available at http://arxiv.org/abs/1502.05933.
### load plsgenomics library library(plsgenomics) ### generating data n <- 100 p <- 1000 sample1 <- sample.cont(n=n, p=p, kstar=20, lstar=2, beta.min=0.25, beta.max=0.75, mean.H=0.2, sigma.H=10, sigma.F=5, sigma.E=5) str(sample1)
### load plsgenomics library library(plsgenomics) ### generating data n <- 100 p <- 1000 sample1 <- sample.cont(n=n, p=p, kstar=20, lstar=2, beta.min=0.25, beta.max=0.75, mean.H=0.2, sigma.H=10, sigma.F=5, sigma.E=5) str(sample1)
The function sample.multinom
generates a random sample of n observations,
composed of p predictors, collected in the n x p matrix X, and a binary
response, in a vector Y of length n, thanks to a logistic model, where the
response Y is generated as a Bernoulli random variable of parameter
logit^{-1}(XB), the coefficients B are sparse. In addition, the covariate
matrix X is composed of correlated blocks of predictors.
sample.multinom( n, p, nb.class = 2, kstar, lstar, beta.min, beta.max, mean.H = 0, sigma.H = 1, sigma.F = 1, seed = NULL )
sample.multinom( n, p, nb.class = 2, kstar, lstar, beta.min, beta.max, mean.H = 0, sigma.H = 1, sigma.F = 1, seed = NULL )
n |
the number of observations in the sample. |
p |
the number of covariates in the sample. |
nb.class |
the number of groups in the data. |
kstar |
the number of underlying latent variables used to generates
the covariate matrix |
lstar |
the number of blocks in the covariate matrix |
beta.min |
the inf bound for non null coefficients (see details). |
beta.max |
the sup bound for non null coefficients (see details). |
mean.H |
the mean of latent variables used to generates |
sigma.H |
the standard deviation of latent variables used to
generates |
sigma.F |
the standard deviation of the noise added to latent
variables used to generates |
seed |
an positive integer, if non NULL it fix the seed (with the
command |
The set (1:p) of predictors is partitioned into kstar block. Each block k (k=1,...,kstar) depends on a latent variable H.k which are independent and identically distributed following a Gaussian distribution N(mean.H, sigma.H^2). Each columns X.j of the matrix X is generated as H.k + F.j for j in the block k, where F.j is independent and identically distributed gaussian noise N(0,sigma.F^2).
The coefficients B are generated as random between beta.min and beta.max on lstar blocks, randomly chosen, and null otherwise. The variables with non null coefficients are then relevant to explain the response, whereas the ones with null coefficients are not.
The response is generated as by drawing one observation of n different Bernoulli random variables of parameters logit^{-1}(XB).
The details of the procedure are developped by Durif et al. (2018).
An object with the following attributes:
X |
the (n x p) covariate matrix, containing the |
Y |
the (n) vector of Y observations. |
proba |
the n vector of Bernoulli parameters used to generate the
response, in particular |
sel |
the index in (1:p) of covariates with non null coefficients
in |
nosel |
the index in (1:p) of covariates with null coefficients
in |
B |
the (n) vector of coefficients. |
block.partition |
a (p) vector indicating the block of each predictors in (1:kstar). |
p |
the number of covariates in the sample. |
kstar |
the number of underlying latent variables used to generates
the covariate matrix |
lstar |
the number of blocks in the covariate matrix |
p0 |
the number of predictors with non null coefficients in |
block.sel |
a (lstar) vector indicating the index in (1:kstar) of
blocks with predictors having non null coefficient in |
beta.min |
the inf bound for non null coefficients (see details). |
beta.max |
the sup bound for non null coefficients (see details). |
mean.H |
the mean of latent variables used to generates |
sigma.H |
the standard deviation of latent variables used to
generates |
sigma.F |
the standard deviation of the noise added to latent
variables used to generates |
seed |
an positive integer, if non NULL it fix the seed
(with the command |
Ghislain Durif (https://gdurif.perso.math.cnrs.fr/).
Durif, G., Modolo, L., Michaelsson, J., Mold, J.E., Lambert-Lacroix, S., Picard, F., 2018. High dimensional classification with combined adaptive sparse PLS and logistic regression. Bioinformatics 34, 485–493. doi:10.1093/bioinformatics/btx571. Available at http://arxiv.org/abs/1502.05933.
### load plsgenomics library library(plsgenomics) ### generating data n <- 100 p <- 1000 nclass <- 3 sample1 <- sample.multinom(n=n, p=p, nb.class=nclass, kstar=20, lstar=2, beta.min=0.25, beta.max=0.75, mean.H=0.2, sigma.H=10, sigma.F=5) str(sample1)
### load plsgenomics library library(plsgenomics) ### generating data n <- 100 p <- 1000 nclass <- 3 sample1 <- sample.multinom(n=n, p=p, nb.class=nclass, kstar=20, lstar=2, beta.min=0.25, beta.max=0.75, mean.H=0.2, sigma.H=10, sigma.F=5) str(sample1)
The function spls.adapt
performs compression and variable selection
in the context of linear regression (with possible prediction)
using Durif et al. (2018) adaptive SPLS algorithm.
spls( Xtrain, Ytrain, lambda.l1, ncomp, weight.mat = NULL, Xtest = NULL, adapt = TRUE, center.X = TRUE, center.Y = TRUE, scale.X = TRUE, scale.Y = TRUE, weighted.center = FALSE )
spls( Xtrain, Ytrain, lambda.l1, ncomp, weight.mat = NULL, Xtest = NULL, adapt = TRUE, center.X = TRUE, center.Y = TRUE, scale.X = TRUE, scale.Y = TRUE, weighted.center = FALSE )
Xtrain |
a (ntrain x p) data matrix of predictor values.
|
Ytrain |
a (ntrain) vector of (continuous) responses. |
lambda.l1 |
a positive real value, in [0,1]. |
ncomp |
a positive integer. |
weight.mat |
a (ntrain x ntrain) matrix used to weight the l2 metric
in the observation space, it can be the covariance inverse of the Ytrain
observations in a heteroskedastic context. If NULL, the l2 metric is the
standard one, corresponding to homoskedastic model ( |
Xtest |
a (ntest x p) matrix containing the predictor values for the
test data set. |
adapt |
a boolean value, indicating whether the sparse PLS selection step sould be adaptive or not (see details). |
center.X |
a boolean value indicating whether the data matrices
|
center.Y |
a boolean value indicating whether the response values
|
scale.X |
a boolean value indicating whether the data matrices
|
scale.Y |
a boolean value indicating whether the response values
|
weighted.center |
a boolean value indicating whether the centering should take into account the weighted l2 metric or not (if TRUE, it requires that weighted.mat is non NULL). |
The columns of the data matrices Xtrain
and Xtest
may
not be standardized, since standardizing can be performed by the function
spls
as a preliminary step.
The procedure described in Durif et al. (2018) is used to compute
latent sparse components that are used in a regression model.
In addition, when a matrix Xtest
is supplied, the procedure
predicts the response associated to these new values of the predictors.
An object of class spls
with the following attributes
Xtrain |
the ntrain x p predictor matrix. |
Ytrain |
the response observations. |
sXtrain |
the centered if so and scaled if so predictor matrix. |
sYtrain |
the centered if so and scaled if so response. |
betahat |
the linear coefficients in model
|
betahat.nc |
the (p+1) vector containing the coefficients and intercept
for the non centered and non scaled model
|
meanXtrain |
the (p) vector of Xtrain column mean, used for centering if so. |
sigmaXtrain |
the (p) vector of Xtrain column standard deviation, used for scaling if so. |
meanYtrain |
the mean of Ytrain, used for centering if so. |
sigmaYtrain |
the standard deviation of Ytrain, used for centering if so. |
X.score |
a (n x ncomp) matrix being the observations coordinates or
scores in the new component basis produced by the compression step
(sparse PLS). Each column t.k of |
X.score.low |
a (n x ncomp) matrix being the PLS components only computed with the selected predictors. |
X.loading |
the (ncomp x p) matrix of coefficients in regression of
Xtrain over the new components |
Y.loading |
the (ncomp) vector of coefficients in regression of Ytrain
over the SPLS components |
X.weight |
a (p x ncomp) matrix being the coefficients of predictors
in each components produced by sparse PLS. Each column w.k of
|
residuals |
the (ntrain) vector of residuals in the model
|
residuals.nc |
the (ntrain) vector of residuals in the non centered
and non scaled model
|
hatY |
the (ntrain) vector containing the estimated reponse values
on the train set of centered and scaled (if so) predictors
|
hatY.nc |
the (ntrain) vector containing the estimated reponse value
on the train set of non centered and non scaled predictors |
hatYtest |
the (ntest) vector containing the predicted values
for the response on the centered and scaled test set |
hatYtest.nc |
the (ntest) vector containing the predicted values
for the response on the non centered and non scaled test set |
A |
the active set of predictors selected by the procedures. |
betamat |
a (ncomp) list of coefficient vector betahat in the model
with |
new2As |
a (ncomp) list of subset of |
lambda.l1 |
the sparse hyper-parameter used to fit the model. |
ncomp |
the number of components used to fit the model. |
V |
the (ntrain x ntrain) matrix used to weight the metric in the sparse PLS step. |
adapt |
a boolean value, indicating whether the sparse PLS selection step was adaptive or not. |
Ghislain Durif (https://gdurif.perso.math.cnrs.fr/).
Adapted in part from spls code by H. Chun, D. Chung and S.Keles (https://CRAN.R-project.org/package=spls).
Durif, G., Modolo, L., Michaelsson, J., Mold, J.E., Lambert-Lacroix, S., Picard, F., 2018. High dimensional classification with combined adaptive sparse PLS and logistic regression. Bioinformatics 34, 485–493. doi:10.1093/bioinformatics/btx571. Available at http://arxiv.org/abs/1502.05933.
Chun, H., & Keles, S. (2010). Sparse partial least squares regression for simultaneous dimension reduction and variable selection. Journal of the Royal Statistical Society. Series B (Methodological), 72(1), 3-25. doi:10.1111/j.1467-9868.2009.00723.x
### load plsgenomics library library(plsgenomics) ### generating data n <- 100 p <- 100 sample1 <- sample.cont(n=n, p=p, kstar=10, lstar=2, beta.min=0.25, beta.max=0.75, mean.H=0.2, sigma.H=10, sigma.F=5, sigma.E=5) X <- sample1$X Y <- sample1$Y ### splitting between learning and testing set index.train <- sort(sample(1:n, size=round(0.7*n))) index.test <- (1:n)[-index.train] Xtrain <- X[index.train,] Ytrain <- Y[index.train,] Xtest <- X[index.test,] Ytest <- Y[index.test,] ### fitting the model, and predicting new observations model1 <- spls(Xtrain=Xtrain, Ytrain=Ytrain, lambda.l1=0.5, ncomp=2, weight.mat=NULL, Xtest=Xtest, adapt=TRUE, center.X=TRUE, center.Y=TRUE, scale.X=TRUE, scale.Y=TRUE, weighted.center=FALSE) str(model1) ### plotting the estimation versus real values for the non centered response plot(model1$Ytrain, model1$hatY.nc, xlab="real Ytrain", ylab="Ytrain estimates") points(-1000:1000,-1000:1000, type="l") ### plotting residuals versus centered response values plot(model1$sYtrain, model1$residuals, xlab="sYtrain", ylab="residuals") ### plotting the predictor coefficients plot(model1$betahat.nc, xlab="variable index", ylab="coeff") ### mean squares error of prediction on test sample sYtest <- as.matrix(scale(Ytest, center=model1$meanYtrain, scale=model1$sigmaYtrain)) sum((model1$hatYtest - sYtest)^2) / length(index.test) ### plotting predicted values versus non centered real response values ## on the test set plot(model1$hatYtest, sYtest, xlab="real Ytest", ylab="predicted values") points(-1000:1000,-1000:1000, type="l")
### load plsgenomics library library(plsgenomics) ### generating data n <- 100 p <- 100 sample1 <- sample.cont(n=n, p=p, kstar=10, lstar=2, beta.min=0.25, beta.max=0.75, mean.H=0.2, sigma.H=10, sigma.F=5, sigma.E=5) X <- sample1$X Y <- sample1$Y ### splitting between learning and testing set index.train <- sort(sample(1:n, size=round(0.7*n))) index.test <- (1:n)[-index.train] Xtrain <- X[index.train,] Ytrain <- Y[index.train,] Xtest <- X[index.test,] Ytest <- Y[index.test,] ### fitting the model, and predicting new observations model1 <- spls(Xtrain=Xtrain, Ytrain=Ytrain, lambda.l1=0.5, ncomp=2, weight.mat=NULL, Xtest=Xtest, adapt=TRUE, center.X=TRUE, center.Y=TRUE, scale.X=TRUE, scale.Y=TRUE, weighted.center=FALSE) str(model1) ### plotting the estimation versus real values for the non centered response plot(model1$Ytrain, model1$hatY.nc, xlab="real Ytrain", ylab="Ytrain estimates") points(-1000:1000,-1000:1000, type="l") ### plotting residuals versus centered response values plot(model1$sYtrain, model1$residuals, xlab="sYtrain", ylab="residuals") ### plotting the predictor coefficients plot(model1$betahat.nc, xlab="variable index", ylab="coeff") ### mean squares error of prediction on test sample sYtest <- as.matrix(scale(Ytest, center=model1$meanYtrain, scale=model1$sigmaYtrain)) sum((model1$hatYtest - sYtest)^2) / length(index.test) ### plotting predicted values versus non centered real response values ## on the test set plot(model1$hatYtest, sYtest, xlab="real Ytest", ylab="predicted values") points(-1000:1000,-1000:1000, type="l")
The function spls.cv
chooses the optimal values for the
hyper-parameter of the spls
procedure, by minimizing the mean
squared error of prediction over the hyper-parameter grid,
using Durif et al. (2018) adaptive SPLS algorithm.
spls.cv( X, Y, lambda.l1.range, ncomp.range, weight.mat = NULL, adapt = TRUE, center.X = TRUE, center.Y = TRUE, scale.X = TRUE, scale.Y = TRUE, weighted.center = FALSE, return.grid = FALSE, ncores = 1, nfolds = 10, nrun = 1, verbose = FALSE )
spls.cv( X, Y, lambda.l1.range, ncomp.range, weight.mat = NULL, adapt = TRUE, center.X = TRUE, center.Y = TRUE, scale.X = TRUE, scale.Y = TRUE, weighted.center = FALSE, return.grid = FALSE, ncores = 1, nfolds = 10, nrun = 1, verbose = FALSE )
X |
a (n x p) data matrix of predictors. |
Y |
a (n) vector of (continuous) responses. |
lambda.l1.range |
a vecor of positive real values, in [0,1].
|
ncomp.range |
a vector of positive integers. |
weight.mat |
a (ntrain x ntrain) matrix used to weight the l2 metric
in the observation space, it can be the covariance inverse of the Ytrain
observations in a heteroskedastic context. If NULL, the l2 metric is the
standard one, corresponding to homoskedastic model ( |
adapt |
a boolean value, indicating whether the sparse PLS selection step sould be adaptive or not (see details). |
center.X |
a boolean value indicating whether the data matrices
|
center.Y |
a boolean value indicating whether the response values
|
scale.X |
a boolean value indicating whether the data matrices
|
scale.Y |
a boolean value indicating whether the response values
|
weighted.center |
a boolean value indicating whether the centering should take into account the weighted l2 metric or not (if TRUE, it requires that weighted.mat is non NULL). |
return.grid |
a boolean values indicating whether the grid of hyper-parameters values with corresponding mean prediction error rate over the folds should be returned or not. |
ncores |
a positve integer, indicating the number of cores that the cross-validation is allowed to use for parallel computation (see details). |
nfolds |
a positive integer indicating the number of folds in the
K-folds cross-validation procedure, |
nrun |
a positive integer indicating how many times the K-folds cross- validation procedure should be repeated, default is 1. |
verbose |
a boolean value indicating verbosity. |
The columns of the data matrices Xtrain
and Xtest
may not
be standardized, since standardizing can be performed by the function
spls.cv
as a preliminary step.
The procedure is described in Durif et al. (2018). The K-fold cross-validation can be summarize as follow: the train set is partitioned into K folds, for each value of hyper-parameters the model is fit K times, using each fold to compute the prediction error rate, and fitting the model on the remaining observations. The cross-validation procedure returns the optimal hyper-parameters values, meaning the one that minimize the mean squared error of prediction averaged over all the folds.
This procedures uses the mclapply
from the parallel
package,
available on GNU/Linux and MacOS. Users of Microsoft Windows can refer to
the README file in the source to be able to use a mclapply type function.
An object with the following attributes
lambda.l1.opt |
the optimal value in |
ncomp.opt |
the optimal value in |
cv.grid |
the grid of hyper-parameters and corresponding prediction
error rate over the folds.
|
Ghislain Durif (https://gdurif.perso.math.cnrs.fr/).
Durif, G., Modolo, L., Michaelsson, J., Mold, J.E., Lambert-Lacroix, S., Picard, F., 2018. High dimensional classification with combined adaptive sparse PLS and logistic regression. Bioinformatics 34, 485–493. doi:10.1093/bioinformatics/btx571. Available at http://arxiv.org/abs/1502.05933.
## Not run: ### load plsgenomics library library(plsgenomics) ### generating data n <- 100 p <- 100 sample1 <- sample.cont(n=n, p=p, kstar=10, lstar=2, beta.min=0.25, beta.max=0.75, mean.H=0.2, sigma.H=10, sigma.F=5, sigma.E=5) X <- sample1$X Y <- sample1$Y ### hyper-parameters values to test lambda.l1.range <- seq(0.05,0.95,by=0.1) # between 0 and 1 ncomp.range <- 1:10 ### tuning the hyper-parameters cv1 <- spls.cv(X=X, Y=Y, lambda.l1.range=lambda.l1.range, ncomp.range=ncomp.range, weight.mat=NULL, adapt=TRUE, center.X=TRUE, center.Y=TRUE, scale.X=TRUE, scale.Y=TRUE, weighted.center=FALSE, return.grid=TRUE, ncores=1, nfolds=10, nrun=1) str(cv1) ### otpimal values cv1$lambda.l1.opt cv1$ncomp.opt ## End(Not run)
## Not run: ### load plsgenomics library library(plsgenomics) ### generating data n <- 100 p <- 100 sample1 <- sample.cont(n=n, p=p, kstar=10, lstar=2, beta.min=0.25, beta.max=0.75, mean.H=0.2, sigma.H=10, sigma.F=5, sigma.E=5) X <- sample1$X Y <- sample1$Y ### hyper-parameters values to test lambda.l1.range <- seq(0.05,0.95,by=0.1) # between 0 and 1 ncomp.range <- 1:10 ### tuning the hyper-parameters cv1 <- spls.cv(X=X, Y=Y, lambda.l1.range=lambda.l1.range, ncomp.range=ncomp.range, weight.mat=NULL, adapt=TRUE, center.X=TRUE, center.Y=TRUE, scale.X=TRUE, scale.Y=TRUE, weighted.center=FALSE, return.grid=TRUE, ncores=1, nfolds=10, nrun=1) str(cv1) ### otpimal values cv1$lambda.l1.opt cv1$ncomp.opt ## End(Not run)
The function spls.stab
train a sparse PLS model for each
candidate values (ncomp, lambda.l1)
of hyper-parameters
on multiple sub-samplings in the data. The stability selection procedure
selects the covariates that are selected by most of the models among the
grid of hyper-parameters, following the procedure described in
Durif et al. (2018). Candidates values for ncomp
and lambda.l1
are respectively given by the input arguments ncomp.range
and
lambda.l1.range
.
spls.stab( X, Y, lambda.l1.range, ncomp.range, weight.mat = NULL, adapt = TRUE, center.X = TRUE, center.Y = TRUE, scale.X = TRUE, scale.Y = TRUE, weighted.center = FALSE, ncores = 1, nresamp = 100, seed = NULL, verbose = TRUE )
spls.stab( X, Y, lambda.l1.range, ncomp.range, weight.mat = NULL, adapt = TRUE, center.X = TRUE, center.Y = TRUE, scale.X = TRUE, scale.Y = TRUE, weighted.center = FALSE, ncores = 1, nresamp = 100, seed = NULL, verbose = TRUE )
X |
a (n x p) data matrix of predictors. |
Y |
a (n) vector of (continuous) responses. |
lambda.l1.range |
a vecor of positive real values, in [0,1].
|
ncomp.range |
a vector of positive integers. |
weight.mat |
a (ntrain x ntrain) matrix used to weight the l2 metric
in the observation space, it can be the covariance inverse of the Ytrain
observations in a heteroskedastic context. If NULL, the l2 metric is the
standard one, corresponding to homoskedastic model ( |
adapt |
a boolean value, indicating whether the sparse PLS selection step sould be adaptive or not (see details). |
center.X |
a boolean value indicating whether the data matrices
|
center.Y |
a boolean value indicating whether the response values
|
scale.X |
a boolean value indicating whether the data matrices
|
scale.Y |
a boolean value indicating whether the response values
|
weighted.center |
a boolean value indicating whether the centering should take into account the weighted l2 metric or not (if TRUE, it requires that weighted.mat is non NULL). |
ncores |
a positve integer, indicating the number of cores that the cross-validation is allowed to use for parallel computation (see details). |
nresamp |
number of resamplings of the data to estimate the probility of selection for each covariate, default is 100. |
seed |
a positive integer value (default is NULL). If non NULL, the seed for pseudo-random number generation is set accordingly. |
verbose |
a boolean parameter indicating the verbosity. |
The columns of the data matrices X
may not be standardized,
since standardizing is performed by the function spls.stab
as a preliminary step.
The procedure is described in Durif et al. (2018). The stability selection procedure can be summarize as follow (c.f. Meinshausen and Buhlmann, 2010).
(i) For each candidate values (ncomp, lambda.l1)
of
hyper-parameters, a logit-SPLS is trained on nresamp
resamplings
of the data. Then, for each pair (ncomp, lambda.l1)
,
the probability that a covariate (i.e. a column in X
) is selected is
computed among the resamplings.
(ii) Eventually, the set of "stable selected" variables corresponds to the set of covariates that were selected by most of the training among the grid of hyper-parameters candidate values.
This function achieves the first step (i) of the stability selection
procedure. The second step (ii) is achieved by the function
stability.selection
.
This procedures uses mclapply
from the parallel
package,
available on GNU/Linux and MacOS. Users of Microsoft Windows can refer to
the README file in the source to be able to use a mclapply type function.
An object with the following attributes
q.Lambda |
A table with values of q.Lambda (c.f. Durif et al. (2018) for the notation), being the averaged number of covariates selected among the entire grid of hyper-parameters candidates values, for increasing size of hyper-parameter grid. |
probs.lambda |
A table with estimated probability of selection for each covariates depending on the candidates values for hyper-parameters. |
p |
An integer values indicating the number of covariates in the model. |
Ghislain Durif (https://gdurif.perso.math.cnrs.fr/).
Durif, G., Modolo, L., Michaelsson, J., Mold, J.E., Lambert-Lacroix, S., Picard, F., 2018. High dimensional classification with combined adaptive sparse PLS and logistic regression. Bioinformatics 34, 485–493. doi:10.1093/bioinformatics/btx571. Available at http://arxiv.org/abs/1502.05933.
Meinshausen, N., Buhlmann P. (2010). Stability Selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72, no. 4, 417-473.
spls
, stability.selection
,
stability.selection.heatmap
## Not run: ### load plsgenomics library library(plsgenomics) ### generating data n <- 100 p <- 100 sample1 <- sample.cont(n=n, p=p, kstar=10, lstar=2, beta.min=0.25, beta.max=0.75, mean.H=0.2, sigma.H=10, sigma.F=5, sigma.E=5) X <- sample1$X Y <- sample1$Y ### hyper-parameters values to test lambda.l1.range <- seq(0.05,0.95,by=0.1) # between 0 and 1 ncomp.range <- 1:10 ### tuning the hyper-parameters stab1 <- spls.stab(X=X, Y=Y, lambda.l1.range=lambda.l1.range, ncomp.range=ncomp.range, adapt=TRUE, ncores=1, nresamp=100) str(stab1) ### heatmap of estimated probabilities stability.selection.heatmap(stab1) ### selected covariates stability.selection(stab1, piThreshold=0.6, rhoError=10) ## End(Not run)
## Not run: ### load plsgenomics library library(plsgenomics) ### generating data n <- 100 p <- 100 sample1 <- sample.cont(n=n, p=p, kstar=10, lstar=2, beta.min=0.25, beta.max=0.75, mean.H=0.2, sigma.H=10, sigma.F=5, sigma.E=5) X <- sample1$X Y <- sample1$Y ### hyper-parameters values to test lambda.l1.range <- seq(0.05,0.95,by=0.1) # between 0 and 1 ncomp.range <- 1:10 ### tuning the hyper-parameters stab1 <- spls.stab(X=X, Y=Y, lambda.l1.range=lambda.l1.range, ncomp.range=ncomp.range, adapt=TRUE, ncores=1, nresamp=100) str(stab1) ### heatmap of estimated probabilities stability.selection.heatmap(stab1) ### selected covariates stability.selection(stab1, piThreshold=0.6, rhoError=10) ## End(Not run)
Gene expression data (2308 genes for 83 samples) from the microarray experiments of Small Round Blue Cell Tumors (SRBCT) of childhood cancer study of Khan et al. (2001).
data(SRBCT)
data(SRBCT)
This data set contains 83 samples with 2308 genes: 29 cases of Ewing sarcoma (EWS), coded 1, 11 cases of Burkitt lymphoma (BL), coded 2, 18 cases of neuroblastoma (NB), coded 3, 25 cases of rhabdomyosarcoma (RMS), coded 4. A total of 63 training samples and 25 test samples are provided in Khan et al. (2001). Five of the test set are non-SRBCT and are not considered here. The training sample indexes correspond to 1:65 and the test sample indexes (without non-SRBCT sample) correspond to 66:83.
A list with the following elements:
X |
a (88 x 2308) matrix giving the expression levels of 2308 genes for 88 SRBCT patients. Each row corresponds to a patient, each column to a gene. |
Y |
a numeric vector of length 88 giving the cancer class of each patient. |
gene.names |
a matrix containing the names of the 2308 genes for the gene
expression matrix |
The data are described in Khan et al. (2001). The data was originally
collected from http://research.nhgri.nih.gov/microarray/Supplement/
but the URL is not working anymore and we could not find a
replacement link in 2024.
Khan, J. and Wei, J. S. and Ringner, M. and Saal, L. H. and Ladanyi, M. and Westermann, F. and Berthold, F. and Schwab, M. and Antonescu, C. R. and Peterson, C. and Meltzer, P. S. (2001). Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks, Nature Medecine, 7, 673–679.
# load plsgenomics library library(plsgenomics) # load data set data(SRBCT) # how many samples and how many genes ? dim(SRBCT$X) # how many samples of class 1, 2, 3 and 4, respectively ? sum(SRBCT$Y==1) sum(SRBCT$Y==2) sum(SRBCT$Y==3) sum(SRBCT$Y==4)
# load plsgenomics library library(plsgenomics) # load data set data(SRBCT) # how many samples and how many genes ? dim(SRBCT$X) # how many samples of class 1, 2, 3 and 4, respectively ? sum(SRBCT$Y==1) sum(SRBCT$Y==2) sum(SRBCT$Y==3) sum(SRBCT$Y==4)
The function stability.selection
returns the list of selected
covariates, when following the stability selection procedure described in
Durif et al. (2018). In particular, it selects covariates that are selected
by most of the sparse PLS, the logit-SPLS or the multinomial-SPLS models
when exploring the grid of hyper-parameter candidate values.
stability.selection(stab.out, piThreshold = 0.9, rhoError = 10)
stability.selection(stab.out, piThreshold = 0.9, rhoError = 10)
stab.out |
the output of the functions |
piThreshold |
a value in (0,1], corresponding to the threshold probability used to select covariate (c.f. Durif et al., 2018). |
rhoError |
a positive value used to restrict the grid of hyper-parameter candidate values (c.f. Durif et al., 2018). |
The procedure is described in Durif et al. (2018). The stability selection procedure can be summarize as follow (c.f. Meinshausen and Buhlmann, 2010).
(i) For each candidate values of hyper-parameters, a model is trained
on nresamp
resamplings of the data. Then, for each candidate value of
the hyper-parameters, the probability that a covariate
(i.e. a column in X
) is selected is computed among the resamplings.
The estimated probabilities can be visualized as a heatmap with the
function stability.selection.heatmap
.
(ii) Eventually, the set of "stable selected" variables corresponds to the
set of covariates that were selected by most of the training among the
grid of hyper-parameters candidate values, based on a threshold probability
piThreshold
and a restriction of the grid of hyper-parameters based
on rhoError
(c.f. Durif et al., 2018, for details).
This function achieves the second step (ii) of the stability selection
procedure. The first step (i) is achieved by the functions
spls.stab
, logit.spls.stab
or multinom.spls.stab
.
An object with the following attributes:
selected.predictors |
The list of the name of covariates that are selected. |
max.probs |
The corresponding estimated probabilities of selection for each covariate, i.e. the maximal values on the reduced grid of hyper-parameters. |
Ghislain Durif (https://gdurif.perso.math.cnrs.fr/).
Durif, G., Modolo, L., Michaelsson, J., Mold, J.E., Lambert-Lacroix, S., Picard, F., 2018. High dimensional classification with combined adaptive sparse PLS and logistic regression. Bioinformatics 34, 485–493. doi:10.1093/bioinformatics/btx571. Available at http://arxiv.org/abs/1502.05933.
Meinshausen, N., Buhlmann P. (2010). Stability Selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72, no. 4, 417-473.
spls.stab
, logit.spls.stab
,
multinom.spls.stab
, stability.selection.heatmap
## Not run: ### load plsgenomics library library(plsgenomics) ### generating data n <- 100 p <- 100 sample1 <- sample.cont(n=n, p=p, kstar=10, lstar=2, beta.min=0.25, beta.max=0.75, mean.H=0.2, sigma.H=10, sigma.F=5, sigma.E=5) X <- sample1$X Y <- sample1$Y ### hyper-parameters values to test lambda.l1.range <- seq(0.05,0.95,by=0.1) # between 0 and 1 ncomp.range <- 1:10 ### tuning the hyper-parameters stab1 <- spls.stab(X=X, Y=Y, lambda.l1.range=lambda.l1.range, ncomp.range=ncomp.range, weight.mat=NULL, adapt=FALSE, center.X=TRUE, center.Y=TRUE, scale.X=TRUE, scale.Y=TRUE, weighted.center=FALSE, ncores=1, nresamp=100) str(stab1) ### selected covariates stability.selection(stab1, piThreshold=0.6, rhoError=10) ## End(Not run)
## Not run: ### load plsgenomics library library(plsgenomics) ### generating data n <- 100 p <- 100 sample1 <- sample.cont(n=n, p=p, kstar=10, lstar=2, beta.min=0.25, beta.max=0.75, mean.H=0.2, sigma.H=10, sigma.F=5, sigma.E=5) X <- sample1$X Y <- sample1$Y ### hyper-parameters values to test lambda.l1.range <- seq(0.05,0.95,by=0.1) # between 0 and 1 ncomp.range <- 1:10 ### tuning the hyper-parameters stab1 <- spls.stab(X=X, Y=Y, lambda.l1.range=lambda.l1.range, ncomp.range=ncomp.range, weight.mat=NULL, adapt=FALSE, center.X=TRUE, center.Y=TRUE, scale.X=TRUE, scale.Y=TRUE, weighted.center=FALSE, ncores=1, nresamp=100) str(stab1) ### selected covariates stability.selection(stab1, piThreshold=0.6, rhoError=10) ## End(Not run)
The function stability.selection.heatmap
allows to visualize
estimated probabilities to be selected for each covariate depending on the
value of hyper-parameters in the spls, logit-spls or multinomial-spls models.
These estimated probabilities are used in the stability selection procedure
described in Durif et al. (2018).
stability.selection.heatmap(stab.out, ...)
stability.selection.heatmap(stab.out, ...)
stab.out |
the output of the functions |
... |
any argument that could be pass to the functions
|
The procedure is described in Durif et al. (2018). The stability selection procedure can be summarize as follow (c.f. Meinshausen and Buhlmann, 2010).
(i) For each candidate values of hyper-parameters, a model is trained
on nresamp
resamplings of the data. Then, for each candidate value of
the hyper-parameters, the probability that a covariate
(i.e. a column in X
) is selected is computed among the resamplings.
The estimated probabilities can be visualized as a heatmap with the
function stability.selection.heatmap
.
(ii) Eventually, the set of "stable selected" variables corresponds to the
set of covariates that were selected by most of the training among the
grid of hyper-parameters candidate values, based on a threshold probability
piThreshold
and a restriction of the grid of hyper-parameters based
on rhoError
(c.f. Durif et al., 2018, for details).
This function allows to visualize probabalities estimated at the first
step (i) of the stability selection by the functions spls.stab
,
logit.spls.stab
or multinom.spls.stab
.
This function use the function matrix.heatmap
.
No return, just plot the heatmap in the current graphic window.
Ghislain Durif (https://gdurif.perso.math.cnrs.fr/).
logit.spls
, stability.selection
,
stability.selection.heatmap
## Not run: ### load plsgenomics library library(plsgenomics) ### generating data n <- 100 p <- 100 sample1 <- sample.cont(n=n, p=p, kstar=10, lstar=2, beta.min=0.25, beta.max=0.75, mean.H=0.2, sigma.H=10, sigma.F=5, sigma.E=5) X <- sample1$X Y <- sample1$Y ### hyper-parameters values to test lambda.l1.range <- seq(0.05,0.95,by=0.1) # between 0 and 1 ncomp.range <- 1:10 ### tuning the hyper-parameters stab1 <- spls.stab(X=X, Y=Y, lambda.l1.range=lambda.l1.range, ncomp.range=ncomp.range, weight.mat=NULL, adapt=FALSE, center.X=TRUE, center.Y=TRUE, scale.X=TRUE, scale.Y=TRUE, weighted.center=FALSE, ncores=1, nresamp=100) str(stab1) ### heatmap of estimated probabilities stability.selection.heatmap(stab1) ## End(Not run)
## Not run: ### load plsgenomics library library(plsgenomics) ### generating data n <- 100 p <- 100 sample1 <- sample.cont(n=n, p=p, kstar=10, lstar=2, beta.min=0.25, beta.max=0.75, mean.H=0.2, sigma.H=10, sigma.F=5, sigma.E=5) X <- sample1$X Y <- sample1$Y ### hyper-parameters values to test lambda.l1.range <- seq(0.05,0.95,by=0.1) # between 0 and 1 ncomp.range <- 1:10 ### tuning the hyper-parameters stab1 <- spls.stab(X=X, Y=Y, lambda.l1.range=lambda.l1.range, ncomp.range=ncomp.range, weight.mat=NULL, adapt=FALSE, center.X=TRUE, center.Y=TRUE, scale.X=TRUE, scale.Y=TRUE, weighted.center=FALSE, ncores=1, nresamp=100) str(stab1) ### heatmap of estimated probabilities stability.selection.heatmap(stab1) ## End(Not run)
The function TFA.estimate
estimates the transcription factor activities from gene
expression data and ChIP data using the PLS multivariate regression approach described
in Boulesteix and Strimmer (2005).
TFA.estimate(CONNECdata, GEdata, ncomp=NULL, nruncv=0, alpha=2/3, unit.weights=TRUE)
TFA.estimate(CONNECdata, GEdata, ncomp=NULL, nruncv=0, alpha=2/3, unit.weights=TRUE)
CONNECdata |
a (n x p) matrix containing the ChIP data for the n genes and the
p predictors. The n genes must be the same as the n genes of |
GEdata |
a (n x m) matrix containing the gene expression levels of the n
considered genes for m samples. Each row of |
ncomp |
if |
nruncv |
the number of cross-validation iterations to be performed for the choice of
the number of latent components. If |
alpha |
the proportion of genes to be included in the training set for the cross-validation procedure. |
unit.weights |
If |
The gene expression data as well as the ChIP data are assumed to have been
properly normalized. However, they do not have to be centered or scaled, since
centering and scaling are performed by the function TFA.estimate
as a
preliminary step.
The matrix ChIPdata
containing the ChIP data for the n genes and p transcription
factors might be replaced by any 'connectivity' matrix whose entries give the strength
of the interactions between the genes and transcription factors. For instance, a connectivity
matrix obtained by aggregating qualitative information from various genomic databases
might be used as argument in place of ChIP data.
A list with the following components:
TFA |
a (p x m) matrix containing the estimated transcription factor activities for the p transcription factors and the m samples. |
metafactor |
a (m x |
ncomp |
the number of latent components used in the PLS regression. |
Anne-Laure Boulesteix (https://www.ibe.med.uni-muenchen.de/mitarbeiter/professoren/boulesteix/index.html) and Korbinian Strimmer (https://strimmerlab.github.io/korbinian.html).
A. L. Boulesteix and K. Strimmer (2005). Predicting Transcription Factor Activities from Combined Analysis of Microarray and ChIP Data: A Partial Least Squares Approach.
A. L. Boulesteix, K. Strimmer (2007). Partial least squares: a versatile tool for the analysis of high-dimensional genomic data. Briefings in Bioinformatics 7:32-44.
S. de Jong (1993). SIMPLS: an alternative approach to partial least squares regression, Chemometrics Intell. Lab. Syst. 18, 251–263.
pls.regression
, pls.regression.cv
.
# load plsgenomics library library(plsgenomics) # load Ecoli data data(Ecoli) # estimate TFAs based on 3 latent components TFA.estimate(Ecoli$CONNECdata,Ecoli$GEdata,ncomp=3,nruncv=0) # estimate TFAs and determine the best number of latent components simultaneously TFA.estimate(Ecoli$CONNECdata,Ecoli$GEdata,ncomp=1:5,nruncv=20)
# load plsgenomics library library(plsgenomics) # load Ecoli data data(Ecoli) # estimate TFAs based on 3 latent components TFA.estimate(Ecoli$CONNECdata,Ecoli$GEdata,ncomp=3,nruncv=0) # estimate TFAs and determine the best number of latent components simultaneously TFA.estimate(Ecoli$CONNECdata,Ecoli$GEdata,ncomp=1:5,nruncv=20)
The function variable.selection
performs variable selection for binary classification.
variable.selection(X, Y, nvar=NULL)
variable.selection(X, Y, nvar=NULL)
X |
a (n x p) data matrix of predictors. X may be a matrix or a data frame. Each row corresponds to an observation and each column corresponds to a predictor variable. |
Y |
a vector of length n giving the classes of the n observations. The two classes must be coded as 1,2. |
nvar |
the number of variables to be returned. If |
The function variable.selection
orders the variables according to
the absolute value of the weight defining the first PLS
component. This ordering is equivalent to the ordering obtained with the
F-statistic and t-test with equal variances (Boulesteix, 2004).
For computational reasons, the function variable.selection
does not use
the pls algorithm, but the obtained ordering of the variables is exactly
equivalent to the ordering obtained using the PLS weights output by
pls.regression
.
A vector of length nvar
(or of length p if nvar=NULL
) containing the indices of
the variables to be selected. The variables are ordered from the best to the worst variable.
Anne-Laure Boulesteix (https://www.ibe.med.uni-muenchen.de/mitarbeiter/professoren/boulesteix/index.html)
A. L. Boulesteix (2004). PLS dimension reduction for classification with microarray data, Statistical Applications in Genetics and Molecular Biology 3, Issue 1, Article 33.
A. L. Boulesteix, K. Strimmer (2007). Partial least squares: a versatile tool for the analysis of high-dimensional genomic data. Briefings in Bioinformatics 7:32-44.
S. de Jong (1993). SIMPLS: an alternative approach to partial least squares regression, Chemometrics Intell. Lab. Syst. 18, 251–263.
# load plsgenomics library library(plsgenomics) # generate X and Y (4 observations and 3 variables) X<-matrix(c(4,3,3,4,1,0,6,7,3,5,5,9),4,3,byrow=FALSE) Y<-c(1,1,2,2) # select the 2 best variables variable.selection(X,Y,nvar=2) # order the 3 variables variable.selection(X,Y) # load the leukemia data data(leukemia) # select the 50 best variables from the leukemia data variable.selection(leukemia$X,leukemia$Y,nvar=50)
# load plsgenomics library library(plsgenomics) # generate X and Y (4 observations and 3 variables) X<-matrix(c(4,3,3,4,1,0,6,7,3,5,5,9),4,3,byrow=FALSE) Y<-c(1,1,2,2) # select the 2 best variables variable.selection(X,Y,nvar=2) # order the 3 variables variable.selection(X,Y) # load the leukemia data data(leukemia) # select the 50 best variables from the leukemia data variable.selection(leukemia$X,leukemia$Y,nvar=50)