I would like to use R to perform a logistic regression with about 30,000 independent variables. That's right, thirty thousand. Most will be irrelevant: the intent is to use the regression to identify the few that actually matter. Among other things, this calls for giving glm() a colossal "y ~ ..." formula with thirty thousand summed terms on its right hand side. I build up the formula as a string and then call as.formula() to convert it. Unfortunately, the conversion fails. The parser reports that it has overflowed its stack. :-( Is there any way to pull this off in R? Can anyone suggest alternatives to glm() or to R itself that might be capable of handling a problem of this size? Or am I insane to even be considering an analysis like this? Thanks! -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
building a formula for glm() with 30,000 independent variables
13 messages · Ben Liblit, Kenneth Cabrera, Brian Ripley +6 more
Just a silly question: If you have 30.000 independent variables, how many records do you have? Have you consider projection pursuit? Best regards
Ben Liblit wrote:
I would like to use R to perform a logistic regression with about 30,000 independent variables. That's right, thirty thousand. Most will be irrelevant: the intent is to use the regression to identify the few that actually matter. Among other things, this calls for giving glm() a colossal "y ~ ..." formula with thirty thousand summed terms on its right hand side. I build up the formula as a string and then call as.formula() to convert it. Unfortunately, the conversion fails. The parser reports that it has overflowed its stack. :-( Is there any way to pull this off in R? Can anyone suggest alternatives to glm() or to R itself that might be capable of handling a problem of this size? Or am I insane to even be considering an analysis like this? Thanks! -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Kenneth Roy Cabrera Torres Celular +57 (315) 405 9339 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Well, the theory of perceptrons says you will find perfect discrimination with high probability even if there is no structure unless n is well in excess of 2p. So you do have 100,000 units? If so you have many gigabytes of data and no R implementation I know of will do this for you. Also, the QR decomposition would take a very long time. You could call glm.fit directly if you could form the design matrix somehow but I doubt if this would run in an acceptable time.
On Sun, 10 Nov 2002, Ben Liblit wrote:
I would like to use R to perform a logistic regression with about 30,000 independent variables. That's right, thirty thousand. Most will be irrelevant: the intent is to use the regression to identify the few that actually matter. Among other things, this calls for giving glm() a colossal "y ~ ..." formula with thirty thousand summed terms on its right hand side. I build up the formula as a string and then call as.formula() to convert it. Unfortunately, the conversion fails. The parser reports that it has overflowed its stack. :-( Is there any way to pull this off in R? Can anyone suggest alternatives to glm() or to R itself that might be capable of handling a problem of this size? Or am I insane to even be considering an analysis like this?
Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272860 (secr) Oxford OX1 3TG, UK Fax: +44 1865 272595 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
On Sun, 10 Nov 2002 06:28:51 -0800
Ben Liblit <liblit at eecs.berkeley.edu> wrote:
I would like to use R to perform a logistic regression with about 30,000 independent variables. That's right, thirty thousand. Most will be irrelevant: the intent is to use the regression to identify the few that actually matter. Among other things, this calls for giving glm() a colossal "y ~ ..." formula with thirty thousand summed terms on its right hand side. I build up the formula as a string and then call as.formula() to convert it. Unfortunately, the conversion fails. The parser reports that it has overflowed its stack. :-( Is there any way to pull this off in R? Can anyone suggest alternatives to glm() or to R itself that might be capable of handling a problem of this size? Or am I insane to even be considering an analysis like this? Thanks! It would be worth doing a simulation first to see if ANY statistical properties of the resulting estimates or P-values work as advertised with your setup. I would expect severe biases, lack of preservation of type I error, and low probability of selecting the "correct" variables. The "few that actually matter" will likely be those whose estimates are made with the most error.
Frank E Harrell Jr Prof. of Biostatistics & Statistics Div. of Biostatistics & Epidem. Dept. of Health Evaluation Sciences U. Virginia School of Medicine http://hesweb1.med.virginia.edu/biostat -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Brian D. Ripley <ripley at stats.ox.ac.uk> wrote:
the theory of perceptrons says you will find perfect discrimination with high probability even if there is no structure unless n is well in excess of 2p. So you do have 100,000 units?
I do not. Not by a long shot. Hmmm. Looks like the problems with R are not as significant as the problems with my own understanding of how to model my data in a sound manner. Thanks for the warning, Brian. I'll have to give this some more thought. -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Kenneth Cabrera wrote:
If you have 30.000 independent variables, how many records do you have?
Not nearly enough, as Brian Ripley has pointed out.
Have you consider projection pursuit?
I hadn't, but perhaps I should. Can you tell me more about what this is, or point me at a good tutorial? [I'm a real novice at statistical analysis.] -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
....if you have really enough cases isn't it at first a technical challenge ;-) ( the statistical sense ?) my suggestion is, try weka (www.cs.waikato.ac.nz/ml/weka/) , a very big linux-machine? + "data-base for large application" & perhaps the Attribute Evaluator as Meta-Classifier (logistic regresssion is possible,too!) is a interesting tool for you !? i didn't now weka have the possibilty and perhaps crash , too. But in the past there where some topics (search the mailing list archive) about data-mining purposes with ~ 2 GB data - good luck ! regards,christian ----- Original Message ----- From: "Ben Liblit" <liblit at eecs.berkeley.edu> To: <r-help at stat.math.ethz.ch> Sent: Sunday, November 10, 2002 3:28 PM Subject: [R] building a formula for glm() with 30,000 independent variables
I would like to use R to perform a logistic regression with about 30,000 independent variables. That's right, thirty thousand. Most will be irrelevant: the intent is to use the regression to identify the few that actually matter. Among other things, this calls for giving glm() a colossal "y ~ ..." formula with thirty thousand summed terms on its right hand side. I build up the formula as a string and then call as.formula() to convert it. Unfortunately, the conversion fails. The parser reports that it has overflowed its stack. :-( Is there any way to pull this off in R? Can anyone suggest alternatives to glm() or to R itself that might be capable of handling a problem of this size? Or am I insane to even be considering an analysis like this? Thanks! -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
-.-.-
r-help mailing list -- Read
http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._. _._ -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
On Sun, 10 Nov 2002, Ben Liblit wrote:
I would like to use R to perform a logistic regression with about 30,000 independent variables. That's right, thirty thousand. Most will be irrelevant: the intent is to use the regression to identify the few that actually matter. Among other things, this calls for giving glm() a colossal "y ~ ..." formula with thirty thousand summed terms on its right hand side. I build up the formula as a string and then call as.formula() to convert it. Unfortunately, the conversion fails. The parser reports that it has overflowed its stack. :-( Is there any way to pull this off in R? Can anyone suggest alternatives to glm() or to R itself that might be capable of handling a problem of this size? Or am I insane to even be considering an analysis like this?
You could call glm.fit directly. It takes a design matrix rather than a formula. You'll still have some trouble with inverting the 30,000x30,000 information matrix, and of course it will be singular if you don't have 30,000 observations. Since there are 2^30000 possible submodels, finding the good ones isn't going to be easy either. My guess is that you want to do something else, but I don't know what. -thomas -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
If the design matrix for such a model is dense then this sounds quite hopeless, but if (as I suspect) it is quite sparse, then glm could be adapted along the lines of the slm.fit.csr code in the SparseM package submitted a few weeks ago. I've run penalized l1 estimation problems of this size, say n= 120,000 by p=30,000, on my now antiquated ultra 2, with half a gig of memory. The trick of course is that only the non zero elements of X are stored and there are only about 500,000 of these in my applications. url: http://www.econ.uiuc.edu Roger Koenker email roger at ysidro.econ.uiuc.edu Department of Economics vox: 217-333-4558 University of Illinois fax: 217-244-6678 Champaign, IL 61820
On Sun, 10 Nov 2002 ripley at stats.ox.ac.uk wrote:
Well, the theory of perceptrons says you will find perfect discrimination with high probability even if there is no structure unless n is well in excess of 2p. So you do have 100,000 units? If so you have many gigabytes of data and no R implementation I know of will do this for you. Also, the QR decomposition would take a very long time. You could call glm.fit directly if you could form the design matrix somehow but I doubt if this would run in an acceptable time. On Sun, 10 Nov 2002, Ben Liblit wrote:
I would like to use R to perform a logistic regression with about 30,000 independent variables. That's right, thirty thousand. Most will be irrelevant: the intent is to use the regression to identify the few that actually matter. Among other things, this calls for giving glm() a colossal "y ~ ..." formula with thirty thousand summed terms on its right hand side. I build up the formula as a string and then call as.formula() to convert it. Unfortunately, the conversion fails. The parser reports that it has overflowed its stack. :-( Is there any way to pull this off in R? Can anyone suggest alternatives to glm() or to R itself that might be capable of handling a problem of this size? Or am I insane to even be considering an analysis like this?
-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272860 (secr) Oxford OX1 3TG, UK Fax: +44 1865 272595 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
You have not really given enough background to enable much help to be given. I have several ideas, but how sensible they might be depends on your situation. Several people have asked how many cases you have, and so do I. It seems to me unlikely that you have 30000 predictors that are all qualitatively different. I imagine that you must have a family or families of related predictors that are themselves described by numerical parameters. Knowledge of any structure on the predictors may suggest strategies for choosing representative predictors. A crude measure of strength of a predictor X might be obtained by calculating a two-sample t statistic for the X values of the 0 and 1 responses. Once a small set of strong predictors has been found you might like to fit the model based on them and calculate the deviance residuals. Then you could treat these residuals as responses and consider variable selection in *linear* models predicing them. Murray Jogensen
Ben Liblit wrote:
I would like to use R to perform a logistic regression with about 30,000 independent variables. That's right, thirty thousand. Most will be irrelevant: the intent is to use the regression to identify the few that actually matter. Among other things, this calls for giving glm() a colossal "y ~ ..." formula with thirty thousand summed terms on its right hand side. I build up the formula as a string and then call as.formula() to convert it. Unfortunately, the conversion fails. The parser reports that it has overflowed its stack. :-( Is there any way to pull this off in R? Can anyone suggest alternatives to glm() or to R itself that might be capable of handling a problem of this size? Or am I insane to even be considering an analysis like this? Thanks! -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Dr Murray Jorgensen http://www.stats.waikato.ac.nz/Staff/maj.html Department of Statistics, University of Waikato, Hamilton, New Zealand Email: maj at waikato.ac.nz Fax 7 838 4155 Phone +64 7 838 4773 wk +64 7 849 6486 home Mobile 021 395 862 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Murray Jorgensen wrote:
You have not really given enough background to enable much help to be given.
In a way, that was intentional. I was hoping that my problem was merely a matter of proper R usage. But several of you have politely pointed out that my underlying thinking about the statistics itself is flawed. With 30,000 predictors and an order of magnitude *fewer* observations, I should expect to find a bogus but perfectly predictive model even if everything were random noise.
Knowledge of any structure on the predictors may suggest strategies for choosing representative predictors.
Understood. Since my statistics background is so weak, perhaps it would be wise at this point to explain what exactly I am trying to accomplish, and thereby better leverage this list's expertise. The "predictors" here are randomly sampled observations of the behavior of a running program. We decide in advance what things to observe. For example, we might decide to check whether a particular pointer on a particular line is null. So that would give us two counters: one telling us how many times it was seen to be null, and one telling us how many times it was seen to be not null. A similar sort of instrumentation would be to guess that a pair of program variables might related in some way. At random intervals we check their values and increment one of three counters depending on whether the first is less then, equal to, or greater than the second. So that would give us a trio of related predictors. We don't update these counters every time a given line of code executes, though: we randomly sample perhaps 1/100 or 1/1000. The samples are fair in the sense that each sampling opportunity is taken or skipped randomly and independently from each other opportunity. The "dependent outcome" is whether the program ultimately crashes or exits successfully. The goal is to identify those program behaviors which are strongly predictive of an eventual crash. For example, if the program has a single buffer overrun bug, we might discover that the "(index > limit) on line 196" counter is nonzero every time we crash, but is zero for most runs that do not crash. (Most, but not all. Sometimes you can overrun a buffer but not crash. "Getting lucky" is part of what we're trying to express in our model.) In my current experiment, I have about 10,000 pairs of program variables being compared with a "less", "equal", and "greater" counter for each. Thus, 30,000 predictors. Almost all of these should be irrelevant. And they certainly are not independent of each other. Looking along the other axis, I've got about 3300 distinct program runs, of which roughly one fifth crash. I have complete and perfect counter information for all of these runs, which I can easily postprocess to simulate sampled counters with any desired sampling density. I'm getting the distinct impression that a standard logistic regression with 30,000 predictors is *not* a practical approach. What should I be using instead? I'm frustrated by the fact that while the problem seems conceptually simple enough, I just don't have the statistics background required to know how to solve it correctly. If any of you have any suggestions, I certainly welcome them. Thank you, one and all. You've been quite generous with your advice already, and I certainly do appreciate it. -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Moderator / list boss: should this be taken off list? I am interested in the current thinking on this issue (of the acceptability of methodological questions), since I myself recently posted a statistical/methodological question. imho, a certain amount of this sort of posting is a "good thing" - it enlivens the list, and implementation and theoretical matters are often intertwined - but then I have to declare an interest .. I'd like to be able to post interesting or challenging problems<g>. So, perhaps some general guidance as to the admissibility of such enquiries could be (re) posted. re Ben Liblit's problem : I don't quite understand how you get 30,000 variables out of 10,000 code instrumentation points but no matter... 10K, 30K all the same difference. You have a simple 0/1 dependent variable (runs OK, crashes) which has a mean of about 0.2. That's OK .. it means that the crash is not such a rare event that you need to consider differential subsampling, not immediately anyway. Why you would want to fit an "additive" model in this context is beyond me .. do you believe that x variable contributes something and z variable something else and that if you add those effects together you are likely to get a better prediction (of a crash)?. I can't imagine it. Consider a simple circuit in which there are several switches along several paths and a lamp which either glows or it does not (success or failure). You MIGHT be able to build a model that implies that switch X contributes a certain amount to "glowingness", that MIGHT fit the data OK, but other than as a guide to further exploration the model is not much use. Surely most crashes have a single cause (or perhaps a chain of causes - some interaction effects), or there is a multiplicity of disparate causes. If your focus is on fault isolation - and assuming you only have this post hoc dataset to work with, rather than having control over the process - then fitting additive models does not imho make a lot of sense. So why not "screen" your predictors for "significance" in the first instance?. This is essentially what recursive partitioning procedures do .. look at rpart. Use a simple t test or some such and throw out all those that appear to have no influence .. if you end up with more than a handful of variables after this weeding out stage then I suggest that you either have a very buggy program or the process is inherently too complex to analyze. Worried about throwing out interaction effects if you throw out all variables that appear non-influential at the first cut? Don't be. The likelihood of an interaction effect being substantive when the main effects are non substantive is quite small.
Murray Jorgensen wrote:
You have not really given enough background to enable much help to be given.
In a way, that was intentional. I was hoping that my problem was merely a matter of proper R usage. But several of you have politely pointed out that my underlying thinking about the statistics itself is flawed. With 30,000 predictors and an order of magnitude *fewer* observations, I should expect to find a bogus but perfectly predictive model even if everything were random noise.
Knowledge of any structure on the predictors may suggest strategies for choosing representative predictors.
Understood. Since my statistics background is so weak, perhaps it would be wise at this point to explain what exactly I am trying to accomplish, and thereby better leverage this list's expertise. The "predictors" here are randomly sampled observations of the behavior of a running program. We decide in advance what things to observe. For example, we might decide to check whether a particular pointer on a particular line is null. So that would give us two counters: one telling us how many times it was seen to be null, and one telling us how many times it was seen to be not null. A similar sort of instrumentation would be to guess that a pair of program variables might related in some way. At random intervals we check their values and increment one of three counters depending on whether the first is less then, equal to, or greater than the second. So that would give us a trio of related predictors. We don't update these counters every time a given line of code executes, though: we randomly sample perhaps 1/100 or 1/1000. The samples are fair in the sense that each sampling opportunity is taken or skipped randomly and independently from each other opportunity. The "dependent outcome" is whether the program ultimately crashes or exits successfully. The goal is to identify those program behaviors which are strongly predictive of an eventual crash. For example, if the program has a single buffer overrun bug, we might discover that the "(index > limit) on line 196" counter is nonzero every time we crash, but is zero for most runs that do not crash. (Most, but not all. Sometimes you can overrun a buffer but not crash. "Getting lucky" is part of what we're trying to express in our model.) In my current experiment, I have about 10,000 pairs of program variables being compared with a "less", "equal", and "greater" counter for each. Thus, 30,000 predictors. Almost all of these should be irrelevant. And they certainly are not independent of each other. Looking along the other axis, I've got about 3300 distinct program runs, of which roughly one fifth crash. I have complete and perfect counter information for all of these runs, which I can easily postprocess to simulate sampled counters with any desired sampling density. I'm getting the distinct impression that a standard logistic regression with 30,000 predictors is *not* a practical approach. What should I be using instead? I'm frustrated by the fact that while the problem seems conceptually simple enough, I just don't have the statistics background required to know how to solve it correctly. If any of you have any suggestions, I certainly welcome them. Thank you, one and all. You've been quite generous with your advice already, and I certainly do appreciate it. -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
John Aitchison -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
John Aitchison <jaitchis at hwy.com.au> wrote:
I am interested in the current thinking on this issue (of the acceptability of methodological questions)
I had the same concern, which is why my first posting tried to avoid giving many details about the analysis itself and focused instead on R usage questions. But I guess the prospect of 30K variables got folks' attention, mostly of the "why in the world are you doing that" variety. My methodology was so clearly absurd that many list members couldn't resist getting involved. :-)
Why you would want to fit an "additive" model in this context is beyond me .. do you believe that x variable contributes something and z variable something else and that if you add those effects together you are likely to get a better prediction (of a crash)?.
Your question is well put. Certainly my system is more of a single-unknown-cause scenario, rather than multiple-contributing-causes. The *more* bad things you do, the more *likely* you are to crash, but ultimately it's only one of those bad things that kills you in any given run. I have to plead ignorance here. I set out after a logistic regression because that's what the statisticians I consulted told me to use. I wish I knew enough to be a more critical consumer of such advice. But I'm a compiler guy, not a statistician, and as such I may be too easily led astray in this domain.
So why not "screen" your predictors for "significance" in the first instance? [...] Use a simple t test or some such and throw out all those that appear to have no influence
Others on this list have offered similar advice, and I am presently working on implementing precisely this approach. Thank you for echoing this advice, and even more so for posing questions to challenge my assumptions and thereby educate me in the best Socratic tradition. Cheers! -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._