Skip to content

[patch] Support many columns in model.matrix

3 messages · Karl Millar, Martin Maechler

#
Generating a model matrix with very large numbers of columns overflows
the stack and/or runs very slowly, due to the implementation of
TrimRepeats().

This patch modifies it to use Rf_duplicated() to find the duplicates.
This makes the running time linear in the number of columns and
eliminates the recursive function calls.

Thanks
-------------- next part --------------
A non-text attachment was scrubbed...
Name: stats_model.patch
Type: text/x-patch
Size: 2182 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20160226/b54f1cc2/attachment.bin>
2 days later
#
> Generating a model matrix with very large numbers of
    > columns overflows the stack and/or runs very slowly, due
    > to the implementation of TrimRepeats().

    > This patch modifies it to use Rf_duplicated() to find the
    > duplicates.  This makes the running time linear in the
    > number of columns and eliminates the recursive function
    > calls.

Thank you, Karl.
I've committed this (very slightly modified) to R-devel,

(also after looking for a an example that runs on a non-huge
 computer and shows the difference) :

nF <- 11 ; set.seed(1)
lff <- setNames(replicate(nF, as.factor(rpois(128, 1/4)), simplify=FALSE), letters[1:nF])
str(dd <- as.data.frame(lff)); prod(sapply(dd, nlevels))
## 'data.frame':	128 obs. of  11 variables:
##  $ a: Factor w/ 3 levels "0","1","2": 1 1 1 2 1 2 2 1 1 1 ...
##  $ b: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 2 1 1 1 ...
##  $ c: Factor w/ 3 levels "0","1","2": 1 1 1 2 1 1 1 2 1 1 ...
##  $ d: Factor w/ 3 levels "0","1","2": 1 1 2 2 1 2 1 1 2 1 ...
##  $ e: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 2 1 ...
##  $ f: Factor w/ 2 levels "0","1": 2 1 2 1 2 1 1 2 1 2 ...
##  $ g: Factor w/ 4 levels "0","1","2","3": 2 1 1 2 1 3 1 1 1 1 ...
##  $ h: Factor w/ 4 levels "0","1","2","4": 1 1 1 1 2 1 1 1 1 1 ...
##  $ i: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 2 ...
##  $ j: Factor w/ 3 levels "0","1","2": 1 2 3 1 1 1 1 1 1 1 ...
##  $ k: Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## 
## [1] 139968

system.time(mff <- model.matrix(~ . ^ 11, dd, contrasts = list(a = "contr.helmert")))
##  user  system elapsed 
## 0.255   0.033   0.287  --- *with* the patch on my desktop (16 GB)
## 1.489   0.031   1.522  --- for R-patched (i.e. w/o the patch)
[1]    128 139968
154791504 bytes

---

BTW: These example would gain tremendously if I finally got
     around to provide

   model.matrix(........, sparse = TRUE)

which would then produce a Matrix-package sparse matrix.

Even for this somewhat small case, a sparse matrix is a factor
of 13.5 x smaller :
[1] 13.47043

I'm happy to collaborate with you on adding such a (C level)
interface to sparse matrices for this case.

Martin Maechler
#
Thanks.

Couldn't you implement model.matrix(..., sparse = TRUE)  with a small
amount of R code similar to MatrixModels::model.Matrix ?

On Mon, Feb 29, 2016 at 10:01 AM, Martin Maechler
<maechler at stat.math.ethz.ch> wrote: