LM: Least Squares on Large Datasets OR why lm() is designed the w ay it is
Hi, I have always been wondering why S-Plus/R can not fit a linear model to an arbitrary large data set given that, I thought, it should be pretty straightforward. Sometime ago I came across a reference to LM package, http://www.econ.uiuc.edu/~anovo/LM.html, by Roger Koenker and Alvaro Novo. So I thought here it is at last, but to my surprise this project hasn't made to the recommended packages and its development seems to be stopped. I take it as a strong evidence that there is a conceptual problem in doing this sort of things and I thought it would be very educational for me to understand it. Here is how I would structure lm object, please feel free to point mistakes out. Suppose we want to analyze lm(Y ~ X), where Y is a vector and X is a matrix 1. Under the classical assumptions of normality and independence of the residuals all information about the model is encapsulated in the covariance matrix of [Y,X] and the observation count, i.e. length(Y). These include variance of coefficients, their significance levels, ability to compute predictions, etc. Moreover, all sub-models, i.e. a regression on any subset of X columns are also readily computable, as well as ANOVA. Given this I'd store the covmatrix of [Y,X] and the count on an lm object and write summary.lm, anova.lm, step, stepAIC functions in terms of these two members only. I guess this is the idea behind the LM package. 2. There is whole lot of tests that are designed to check the classical assumptions of normality of the residuals, detect influential points, etc. Obviously these can not possibly be carried out without the residuals, etc. So the lm object should provide a slot for the residuals, but whether the residuals are in fact computed should not affect the functions mentioned in the previous paragraph. I will appreciate any comment on this "design". Thanks, Vadim -------------------------------------------------- DISCLAIMER This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify me and permanently delete the original and any copy of any e-mail and any printout thereof. E-mail transmission cannot be guaranteed to be secure or error-free. The sender therefore does not accept liability for any errors or omissions in the contents of this message which arise as a result of e-mail transmission. NOTICE regarding privacy and confidentiality Knight Trading Group may, at its discretion, monitor and review the content of all e-mail communications. -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._