Cross validation for L2E sparse regression with distance penalization

CV_L2E_sparse_dist performs k-fold cross-validation for robust sparse regression under the L2 criterion with distance penalty

CV_L2E_sparse_dist(
  y,
  X,
  beta0,
  tau0,
  kSeq,
  rhoSeq,
  nfolds = 5,
  seed = 1234,
  method = "median",
  max_iter = 100,
  tol = 1e-04,
  trace = TRUE
)

Arguments

y	Response vector
X	Design matrix
beta0	Initial vector of regression coefficients, can be omitted
tau0	Initial precision estimate, can be omitted
kSeq	A sequence of tuning parameter k, the number of nonzero entries in the estimated coefficients
rhoSeq	A sequence of tuning parameter rho, can be omitted
nfolds	The number of cross-validation folds. Default is 5.
seed	Users can set the seed of the random number generator to obtain reproducible results.
method	Median or mean to compute the objective
max_iter	Maximum number of iterations
tol	Relative tolerance
trace	Whether to trace the progress of the cross-validation

Value

Returns a list object containing the mean and standard error of the cross-validation error (vectors) -- CVE and CVSE -- for each value of k, the index of the k value with the minimum CVE and the k value itself (scalars), the index of the k value with the 1SE CVE and the k value itself (scalars), the sequence of rho and k used in the regression (vectors), and a vector listing which fold each element of y was assigned to

Examples

## Completes in 15 seconds

set.seed(12345)
n <- 100
tau <- 1
f <- matrix(c(rep(2,5), rep(0,45)), ncol = 1)
X <- X0 <- matrix(rnorm(n*50), nrow = n)
y <- y0 <- X0 %*% f + (1/tau)*rnorm(n)

## Clean Data
k <- c(6,5,4)
cv <- CV_L2E_sparse_dist(y=y, X=X, kSeq=k, nfolds=2, seed=1234)
#> Starting CV fold #1
#> Starting CV fold #2
(k_min <- cv$k.min) ## selected number of nonzero entries
#> [1] 5

sol <- L2E_sparse_dist(y=y, X=X, kSeq=k_min)
#>    user  system elapsed 
#>   0.756   0.000   0.755 
r <- y - X %*% sol$Beta
ix <- which(abs(r) > 3/sol$Tau)
l2e_fit <- X %*% sol$Beta

plot(y, l2e_fit, ylab='Predicted values', pch=16, cex=0.8)
points(y[ix], l2e_fit[ix], pch=16, col='blue', cex=0.8)

## Contaminated Data
i <- 1:5
y[i] <- 2 + y0[i]
X[i,] <- 2 + X0[i,]

cv <- CV_L2E_sparse_dist(y=y, X=X, kSeq=k, nfolds=2, seed=1234)
#> Starting CV fold #1
#> Starting CV fold #2
(k_min <- cv$k.min) ## selected number of nonzero entries
#> [1] 5

sol <- L2E_sparse_dist(y=y, X=X, kSeq=k_min)
#>    user  system elapsed 
#>   0.791   0.000   0.791 
r <- y - X %*% sol$Beta
ix <- which(abs(r) > 3/sol$Tau)
l2e_fit <- X %*% sol$Beta

plot(y, l2e_fit, ylab='Predicted values', pch=16, cex=0.8)
points(y[ix], l2e_fit[ix], pch=16, col='blue', cex=0.8)