More seriously, the approach to speeding up model fitting that has been most successful to date is to speed up the BLAS
(Basic Linear Algebra Subroutines), especially the Level-3 BLAS. The bulk of the computation in the Matrix package takes
place in either Lapack (for dense matrices) or CHOLMOD (for sparse matrices) code and those are based on calls to the
Levels 1, 2 and 3 BLAS. The Atlas package and K. Goto's BLAS are designed to obtain the highest level of performance
possible from the CPU on these routines. I think the easiest way of incorporating the power of the GPU into the model
fitting process would be to port the BLAS to the GPU. I also imagine that someone somewhere has already started on that.