Dear List,
I'm quite new to R and want to do logistic regression with a 200K
feature data set (around 150 training examples).
I'm aware that I should use Naive Bayes but I have a more general
question about the capability of R handling very high dimensional data.
Please consider the following R code where "mygenestrain.tab" is a 150
by 200000 matrix:
traindata <- read.table('mygenestrain.tab');
mylogit <- glm(V1 ~ ., data = traindata, family = "binomial");
When executing this code I get the following error:
Error in terms.formula(formula, data = data) :
allocMatrix: too many elements specified
Calls: glm ... model.frame -> model.frame.default -> terms -> terms.formula
Execution halted
Is this because R can't handle 200K features or am I doing something
completely wrong here?
Thanks a lot for your help!
best Regards,
Romeo
Logistic Regression with 200K features in R?
8 messages · Eik Vettorazzi, Romeo Kienzler, Duncan Murdoch
it is simply because you can't do a regression with more predictors than observations. Cheers. Am 12.12.2013 09:00, schrieb Romeo Kienzler:
Dear List,
I'm quite new to R and want to do logistic regression with a 200K
feature data set (around 150 training examples).
I'm aware that I should use Naive Bayes but I have a more general
question about the capability of R handling very high dimensional data.
Please consider the following R code where "mygenestrain.tab" is a 150
by 200000 matrix:
traindata <- read.table('mygenestrain.tab');
mylogit <- glm(V1 ~ ., data = traindata, family = "binomial");
When executing this code I get the following error:
Error in terms.formula(formula, data = data) :
allocMatrix: too many elements specified
Calls: glm ... model.frame -> model.frame.default -> terms -> terms.formula
Execution halted
Is this because R can't handle 200K features or am I doing something
completely wrong here?
Thanks a lot for your help!
best Regards,
Romeo
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Eik Vettorazzi Department of Medical Biometry and Epidemiology University Medical Center Hamburg-Eppendorf Martinistr. 52 20246 Hamburg T ++49/40/7410-58243 F ++49/40/7410-57790 -- Besuchen Sie uns auf: www.uke.de _____________________________________________________________________ Universit?tsklinikum Hamburg-Eppendorf; K?rperschaft des ?ffentlichen Rechts; Gerichtsstand: Hamburg Vorstandsmitglieder: Prof. Dr. Christian Gerloff (Vertreter des Vorsitzenden), Prof. Dr. Dr. Uwe Koch-Gromus, Joachim Pr?l?, Rainer Schoppik _____________________________________________________________________ SAVE PAPER - THINK BEFORE PRINTING
ok, so 200K predictors an 10M observations would work?
On 12/12/2013 12:12 PM, Eik Vettorazzi wrote:
it is simply because you can't do a regression with more predictors than observations. Cheers. Am 12.12.2013 09:00, schrieb Romeo Kienzler:
Dear List,
I'm quite new to R and want to do logistic regression with a 200K
feature data set (around 150 training examples).
I'm aware that I should use Naive Bayes but I have a more general
question about the capability of R handling very high dimensional data.
Please consider the following R code where "mygenestrain.tab" is a 150
by 200000 matrix:
traindata <- read.table('mygenestrain.tab');
mylogit <- glm(V1 ~ ., data = traindata, family = "binomial");
When executing this code I get the following error:
Error in terms.formula(formula, data = data) :
allocMatrix: too many elements specified
Calls: glm ... model.frame -> model.frame.default -> terms -> terms.formula
Execution halted
Is this because R can't handle 200K features or am I doing something
completely wrong here?
Thanks a lot for your help!
best Regards,
Romeo
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
I thought so (with all the limitations due to collinearity and so on), but actually there is a limit for the maximum size of an array which is independent of your memory size and is due to the way arrays are indexed. You can't create an object with more than 2^31-1 = 2147483647 elements. https://stat.ethz.ch/pipermail/r-help/2007-June/133238.html cheers Am 12.12.2013 12:34, schrieb Romeo Kienzler:
ok, so 200K predictors an 10M observations would work? On 12/12/2013 12:12 PM, Eik Vettorazzi wrote:
it is simply because you can't do a regression with more predictors than observations. Cheers. Am 12.12.2013 09:00, schrieb Romeo Kienzler:
Dear List,
I'm quite new to R and want to do logistic regression with a 200K
feature data set (around 150 training examples).
I'm aware that I should use Naive Bayes but I have a more general
question about the capability of R handling very high dimensional data.
Please consider the following R code where "mygenestrain.tab" is a 150
by 200000 matrix:
traindata <- read.table('mygenestrain.tab');
mylogit <- glm(V1 ~ ., data = traindata, family = "binomial");
When executing this code I get the following error:
Error in terms.formula(formula, data = data) :
allocMatrix: too many elements specified
Calls: glm ... model.frame -> model.frame.default -> terms ->
terms.formula
Execution halted
Is this because R can't handle 200K features or am I doing something
completely wrong here?
Thanks a lot for your help!
best Regards,
Romeo
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Eik Vettorazzi Department of Medical Biometry and Epidemiology University Medical Center Hamburg-Eppendorf Martinistr. 52 20246 Hamburg T ++49/40/7410-58243 F ++49/40/7410-57790 -- Besuchen Sie uns auf: www.uke.de _____________________________________________________________________ Universit?tsklinikum Hamburg-Eppendorf; K?rperschaft des ?ffentlichen Rechts; Gerichtsstand: Hamburg Vorstandsmitglieder: Prof. Dr. Christian Gerloff (Vertreter des Vorsitzenden), Prof. Dr. Dr. Uwe Koch-Gromus, Joachim Pr?l?, Rainer Schoppik _____________________________________________________________________ SAVE PAPER - THINK BEFORE PRINTING
Dear Eik, thank you so much for your help! best Regards, Romeo
On 12/12/2013 12:51 PM, Eik Vettorazzi wrote:
I thought so (with all the limitations due to collinearity and so on), but actually there is a limit for the maximum size of an array which is independent of your memory size and is due to the way arrays are indexed. You can't create an object with more than 2^31-1 = 2147483647 elements. https://stat.ethz.ch/pipermail/r-help/2007-June/133238.html cheers Am 12.12.2013 12:34, schrieb Romeo Kienzler:
ok, so 200K predictors an 10M observations would work? On 12/12/2013 12:12 PM, Eik Vettorazzi wrote:
it is simply because you can't do a regression with more predictors than observations. Cheers. Am 12.12.2013 09:00, schrieb Romeo Kienzler:
Dear List,
I'm quite new to R and want to do logistic regression with a 200K
feature data set (around 150 training examples).
I'm aware that I should use Naive Bayes but I have a more general
question about the capability of R handling very high dimensional data.
Please consider the following R code where "mygenestrain.tab" is a 150
by 200000 matrix:
traindata <- read.table('mygenestrain.tab');
mylogit <- glm(V1 ~ ., data = traindata, family = "binomial");
When executing this code I get the following error:
Error in terms.formula(formula, data = data) :
allocMatrix: too many elements specified
Calls: glm ... model.frame -> model.frame.default -> terms ->
terms.formula
Execution halted
Is this because R can't handle 200K features or am I doing something
completely wrong here?
Thanks a lot for your help!
best Regards,
Romeo
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
On 13-12-12 6:51 AM, Eik Vettorazzi wrote:
I thought so (with all the limitations due to collinearity and so on), but actually there is a limit for the maximum size of an array which is independent of your memory size and is due to the way arrays are indexed. You can't create an object with more than 2^31-1 = 2147483647 elements. https://stat.ethz.ch/pipermail/r-help/2007-June/133238.html
That post is from 2007. The limits were raised considerably when R 3.0.0 was released, and it is now 2^48 for disk-based operations, 2^52 for working in memory. Duncan Murdoch
cheers Am 12.12.2013 12:34, schrieb Romeo Kienzler:
ok, so 200K predictors an 10M observations would work? On 12/12/2013 12:12 PM, Eik Vettorazzi wrote:
it is simply because you can't do a regression with more predictors than observations. Cheers. Am 12.12.2013 09:00, schrieb Romeo Kienzler:
Dear List,
I'm quite new to R and want to do logistic regression with a 200K
feature data set (around 150 training examples).
I'm aware that I should use Naive Bayes but I have a more general
question about the capability of R handling very high dimensional data.
Please consider the following R code where "mygenestrain.tab" is a 150
by 200000 matrix:
traindata <- read.table('mygenestrain.tab');
mylogit <- glm(V1 ~ ., data = traindata, family = "binomial");
When executing this code I get the following error:
Error in terms.formula(formula, data = data) :
allocMatrix: too many elements specified
Calls: glm ... model.frame -> model.frame.default -> terms ->
terms.formula
Execution halted
Is this because R can't handle 200K features or am I doing something
completely wrong here?
Thanks a lot for your help!
best Regards,
Romeo
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
thanks Duncan for this clarification. A double precision matrix with 2e11 elements (as the op wanted) would need about 1.5 TB memory, that's more than a standard (windows 64bit) computer can handle. Cheers. Am 12.12.2013 13:00, schrieb Duncan Murdoch:
On 13-12-12 6:51 AM, Eik Vettorazzi wrote:
I thought so (with all the limitations due to collinearity and so on), but actually there is a limit for the maximum size of an array which is independent of your memory size and is due to the way arrays are indexed. You can't create an object with more than 2^31-1 = 2147483647 elements. https://stat.ethz.ch/pipermail/r-help/2007-June/133238.html
That post is from 2007. The limits were raised considerably when R 3.0.0 was released, and it is now 2^48 for disk-based operations, 2^52 for working in memory. Duncan Murdoch
cheers Am 12.12.2013 12:34, schrieb Romeo Kienzler:
ok, so 200K predictors an 10M observations would work? On 12/12/2013 12:12 PM, Eik Vettorazzi wrote:
it is simply because you can't do a regression with more predictors than observations. Cheers. Am 12.12.2013 09:00, schrieb Romeo Kienzler:
Dear List,
I'm quite new to R and want to do logistic regression with a 200K
feature data set (around 150 training examples).
I'm aware that I should use Naive Bayes but I have a more general
question about the capability of R handling very high dimensional
data.
Please consider the following R code where "mygenestrain.tab" is a 150
by 200000 matrix:
traindata <- read.table('mygenestrain.tab');
mylogit <- glm(V1 ~ ., data = traindata, family = "binomial");
When executing this code I get the following error:
Error in terms.formula(formula, data = data) :
allocMatrix: too many elements specified
Calls: glm ... model.frame -> model.frame.default -> terms ->
terms.formula
Execution halted
Is this because R can't handle 200K features or am I doing something
completely wrong here?
Thanks a lot for your help!
best Regards,
Romeo
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Eik Vettorazzi Department of Medical Biometry and Epidemiology University Medical Center Hamburg-Eppendorf Martinistr. 52 20246 Hamburg T ++49/40/7410-58243 F ++49/40/7410-57790 -- Besuchen Sie uns auf: www.uke.de _____________________________________________________________________ Universit?tsklinikum Hamburg-Eppendorf; K?rperschaft des ?ffentlichen Rechts; Gerichtsstand: Hamburg Vorstandsmitglieder: Prof. Dr. Christian Gerloff (Vertreter des Vorsitzenden), Prof. Dr. Dr. Uwe Koch-Gromus, Joachim Pr?l?, Rainer Schoppik _____________________________________________________________________ SAVE PAPER - THINK BEFORE PRINTING
On 12/12/2013 7:08 AM, Eik Vettorazzi wrote:
thanks Duncan for this clarification. A double precision matrix with 2e11 elements (as the op wanted) would need about 1.5 TB memory, that's more than a standard (windows 64bit) computer can handle.
According to Microsoft's "Memory Limits" web page (currently at http://msdn.microsoft.com/en-us/library/windows/desktop/aa366778%28v=vs.85%29.aspx#memory_limits, but these things tend to move around), the limit is 8 TB for virtual memory. (The same page lists a variety of smaller physical memory limits, depending on the Windows version, but R doesn't need physical memory, virtual is good enough. ) R would be very slow if it was working with objects bigger than physical memory, but it could conceivably work. Duncan Murdoch
Cheers. Am 12.12.2013 13:00, schrieb Duncan Murdoch:
On 13-12-12 6:51 AM, Eik Vettorazzi wrote:
I thought so (with all the limitations due to collinearity and so on), but actually there is a limit for the maximum size of an array which is independent of your memory size and is due to the way arrays are indexed. You can't create an object with more than 2^31-1 = 2147483647 elements. https://stat.ethz.ch/pipermail/r-help/2007-June/133238.html
That post is from 2007. The limits were raised considerably when R 3.0.0 was released, and it is now 2^48 for disk-based operations, 2^52 for working in memory. Duncan Murdoch
cheers Am 12.12.2013 12:34, schrieb Romeo Kienzler:
ok, so 200K predictors an 10M observations would work? On 12/12/2013 12:12 PM, Eik Vettorazzi wrote:
it is simply because you can't do a regression with more predictors than observations. Cheers. Am 12.12.2013 09:00, schrieb Romeo Kienzler:
Dear List,
I'm quite new to R and want to do logistic regression with a 200K
feature data set (around 150 training examples).
I'm aware that I should use Naive Bayes but I have a more general
question about the capability of R handling very high dimensional
data.
Please consider the following R code where "mygenestrain.tab" is a 150
by 200000 matrix:
traindata <- read.table('mygenestrain.tab');
mylogit <- glm(V1 ~ ., data = traindata, family = "binomial");
When executing this code I get the following error:
Error in terms.formula(formula, data = data) :
allocMatrix: too many elements specified
Calls: glm ... model.frame -> model.frame.default -> terms ->
terms.formula
Execution halted
Is this because R can't handle 200K features or am I doing something
completely wrong here?
Thanks a lot for your help!
best Regards,
Romeo
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.