I would like to perform a multinomial logistic regression on a large
data set, but do not know how. I've only thought of a few possibilities
and write to seek advice and guidance on them or deepening or expanding
my search.
On smaller data sets, I have successfully loaded the data and issued
commands such as:
length(levels(factor(data$response)))
[1] 6 # implies polychotomy
library(nnet)
result <- multinom(data$response ~ 1 + data$var1 + data$var2 + ...)
# (I am interested in at most ten
# parameters; usually less than six)
For a 60-MB comma-separated-values text-format data file (with a few
hundred thousand records), object.size(data) returns roughly 86 MB. Now
I am considering loading a 7-GB data file (with about 30 million
records). (In the near future, I may be interested in loading a 50-GB
data file, but right now I am still trying things out on smaller sets.)
What should I do?
1. I recall some discussion from August 2006 about the use of the biglm
package. (Subject: lean and mean lm/glm?) This seems potentially very
useful, but it's not clear to me how to fit a multinomial response. Can
I get bigglm to fit polychotomous data?
2. Earlier, I thought I ran across an example (perhaps in V&R's MASS4 or
Harrell's Regression Modeling Strategies) showing how to use glm and an
appropriate family specification to perform a multinomial logistic
regression, but now I cannot find the example. This is what had to be
done before the multinom() function became available, and it still
works, but I need a reference or example --- can anyone point me to it?
I suspect part of my problem is that I do not understand the
documentation on 'family': I'm not sure what the 'object' argument is,
defined:
"object: the function family accesses the family objects which are
stored within objects created by modelling functions (e.g., glm)."
My impression is that glm() returns a glm object. I'm not sure what to
write there.
If the example doesn't exist, my brain may have [wishfully] inserted the
"multinomial" into my memory. It's clear that glm can be used for
[ordinary/binomial] logistic regression.
3. I have skimmed Chen & Ripley's papers on computing near the data, but
suspect that I will need to do quite a lot of work (read: careful
reading, hand holding, and development) to adapt their solution.
4. I have briefly browsed the documentation on setting larger memory
size flags, but suspect that that's not a scalable route. My desktop
WinXP PC has 2 GB of RAM; a linux computer I prefer has 8 GB, and I
suspect both copies of R were compiled as 32-bit (but I don't know how
to verify this).
box$ uname -a
Linux box 2.4.21-32.0.1.ELsmp #1 SMP Tue May 17 17:52:23 EDT 2005 i686
i686 i386 GNU/Linux
box$ R --max-vsize='4G'
WARNING: --max-vsize=4G=4'M': too large and ignored
5. If all else fails, I can sample the data and check the sample for an
appropriate distribution.
Richard
212-933-3305 / richard.c.yeh at bankofamerica.com
NOTICE TO RECIPIENTS: Any information contained in or attached to this message is intended solely for the use of the intended recipient(s). If you are not the intended recipient of this transmittal, you are hereby notified that you received this transmittal in error, and we request that you please delete and destroy all copies and attachments in your possession, notify the sender that you have received this communication in error, and note that any review or dissemination of, or the taking of any action in reliance on, this communication is expressly prohibited.
Banc of America Securities LLC ("BAS") does not accept time-sensitive, action-oriented messages or transaction orders, including orders to purchase or sell securities, via e-mail.
Regular internet e-mail transmission cannot be guaranteed to be secure or error-free. Therefore, we do not represent that this information is complete or accurate, and it should not be relied upon as such. If you prefer to communicate with BAS using secure (i.e., encrypted) e-mail transmission, please notify the sender. Otherwise, you will be deemed to have consented to communicate with BAS via regular internet e-mail transmission. Please note that BAS reserves the right to intercept, monitor, and retain all e-mail messages (including secure e-mail messages) sent to or from its systems as permitted by applicable law.
----------------------------------------------------------------------
IRS Circular 230 Disclosure:
Bank of America Corporation and its affiliates, including BAS, ("Bank of America") do not provide tax advice. Accordingly, any statements contained herein as to tax matters were neither written nor intended by the sender or Bank of America to be used and cannot be used by any taxpayer for the purpose of avoiding tax penalties that may be imposed on such taxpayer. If any person uses or refers to any such tax statement in promoting, marketing or recommending a partnership or other entity, investment plan or arrangement to any taxpayer, then the statement expressed above is being delivered to support the promotion or marketing of the transaction or matter addressed, and the recipient should seek advice based on its particular circumstances from an independent tax advisor.
multinom(nnet) analogy for biglm package?
4 messages · Yeh, Richard C, Brian Ripley
OK, well, seeing Thomas Lumley's post earlier today, I figured out the answer to #4:
gc()
used (Mb) gc trigger (Mb) max used (Mb) Ncells 1115191 29.8 3469679 92.7 13981968 373.4 Vcells 14796791 112.9 79783730 608.8 124640525 951.0
c <- rnorm(1e9)
Error in rnorm(1e+09) : cannot allocate vector of length 1000000000
I am using:
R version 2.4.0 Patched (2006-10-03 r39576)
212-933-3305 / richard.c.yeh at bankofamerica.com
-----Original Message-----
suspect both copies of R were compiled as 32-bit (but I don't know how
to verify this).
NOTICE TO RECIPIENTS: Any information contained in or attached to this message is intended solely for the use of the intended recipient(s). If you are not the intended recipient of this transmittal, you are hereby notified that you received this transmittal in error, and we request that you please delete and destroy all copies and attachments in your possession, notify the sender that you have received this communication in error, and note that any review or dissemination of, or the taking of any action in reliance on, this communication is expressly prohibited.
Banc of America Securities LLC ("BAS") does not accept time-sensitive, action-oriented messages or transaction orders, including orders to purchase or sell securities, via e-mail.
Regular internet e-mail transmission cannot be guaranteed to be secure or error-free. Therefore, we do not represent that this information is complete or accurate, and it should not be relied upon as such. If you prefer to communicate with BAS using secure (i.e., encrypted) e-mail transmission, please notify the sender. Otherwise, you will be deemed to have consented to communicate with BAS via regular internet e-mail transmission. Please note that BAS reserves the right to intercept, monitor, and retain all e-mail messages (including secure e-mail messages) sent to or from its systems as permitted by applicable law.
----------------------------------------------------------------------
IRS Circular 230 Disclosure:
Bank of America Corporation and its affiliates, including BAS, ("Bank of America") do not provide tax advice. Accordingly, any statements contained herein as to tax matters were neither written nor intended by the sender or Bank of America to be used and cannot be used by any taxpayer for the purpose of avoiding tax penalties that may be imposed on such taxpayer. If any person uses or refers to any such tax statement in promoting, marketing or recommending a partnership or other entity, investment plan or arrangement to any taxpayer, then the statement expressed above is being delivered to support the promotion or marketing of the transaction or matter addressed, and the recipient should seek advice based on its particular circumstances from an independent tax advisor.
Here is the direct way:
.Machine$sizeof.pointer
[1] 8 on a 64-bit system. You can also figure it out from the size of the Ncells, clearly 28 bytes in your example. You seem to believe a multinomial logistic regression is a GLM: it is not.
On Thu, 21 Dec 2006, Yeh, Richard C wrote:
OK, well, seeing Thomas Lumley's post earlier today, I figured out the answer to #4:
gc()
used (Mb) gc trigger (Mb) max used (Mb) Ncells 1115191 29.8 3469679 92.7 13981968 373.4 Vcells 14796791 112.9 79783730 608.8 124640525 951.0
c <- rnorm(1e9)
Error in rnorm(1e+09) : cannot allocate vector of length 1000000000 I am using: R version 2.4.0 Patched (2006-10-03 r39576) 212-933-3305 / richard.c.yeh at bankofamerica.com -----Original Message----- suspect both copies of R were compiled as 32-bit (but I don't know how to verify this).
Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Dear Prof. Ripley, Many thanks for your reply, especially during the holiday season!
From: Prof Brian Ripley [mailto:ripley at stats.ox.ac.uk] You seem to believe a multinomial logistic regression is a GLM: it is
not.
With one line, you've struck the heart of one of my problems.
As a follow-up question, would you (or anyone else) care to comment on
the feasibility of using R to perform multinomial logistic regression on
a large data set?
The scale of the problem is similar to the one you and Fei Chen treated:
instead of insurance policies, I am considering residential mortgages.
I read that you and Fei Chen used a generalized linear model, which I
understand you are saying does not apply to my approach. But, it's not
obvious to me why I cannot just guess some coefficients, partition the
large data file into digestible excerpts, score each excerpt with a
log-likelihood function, combine the scores and gradients, and iterate
the guess. Is that totally different from what you described in
"Statistical Computing and Databases" (Proc. DSC 2003)?
Thanks again for helping me as I consider how to approach my problem!
Richard
212-933-3305 / richard.c.yeh at bankofamerica.com
NOTICE TO RECIPIENTS: Any information contained in or attach...{{dropped}}