Skip to content
Prev 252 / 523 Next

[RsR] Singular covariance in plot.lmrob

Thanks Christian for reminding me of this issue. It was discussed in 
Banff last year, and it may in principle happen any time you have a 
categorical explanatory variable in your model, as the design matrix 
becomes sparse and sub-sampling search algorithms tend to produce too 
many singular subsamples of size p+1.

I am not sure that this can be fixed by lowering the BP in the current 
MCD algorithm. Note how your example fails with a message that 14 (out 
of 392) obs. are on a lower-dimensional hyperplane. Shouldn't we be 
considering samples of size ~ 200? I believe this error message may be 
more related to the random subsampling search than the BP of the target 
estimator. Maybe Valentin can help me understand what is happening here.

For the linear regression case, I would argue the following: since 
Mahalanobis distances can be hard to interpret for categorical 
variables, one possibility would be to simply remove these "factor" 
variables when calculating the distances for the plot. Sometimes, 
however, the user may have already "coded" the factors into rows of 0's 
and 1's (instead of using proper factor variables in the formula), which 
would be a more difficult case to protect against.

For the more general multivariate location/scatter problem, I believe 
the default "failing" behaviour of the MCD algorithm may need to be 
revisited, since, as you mention, one may still want to get a (singular) 
covariance matrix estimator when half the data are lying on a 
lower-dimensional hyperplane. While we've had this conversation in the 
past, we never reached much of an consensus. Maybe it is time to try again.

Matias
Christian Hennig wrote: