Skip to content

Help/information required

2 messages · Ajay Arvind Rao, David Winsemius

#
Hi,

We are using open source license of R to analyze data at our organization. The system configuration are as follows:

*        System configuration:

o   Operating System - Windows 7 Enterprise SP1, 64 bit (Desktop)

o   RAM - 8 GB

o   Processor - i5-6500 @ 3.2 Ghz

*        R Version:

o   R Studio 1.0.136

o   R 3.4.0

While trying to merge two datasets we received the following resource error message on running the code
Code: merg_data <- merge(x=Data_1Junto30Jun,y=flight_code,by.x="EB_FLNO1",by.y="EB_FLNO1",all.x = TRUE)
Error Message: Error: cannot allocate vector of size 5.8 Gb

Later we tried running the code differently but error still remained
Code: merg_data <- sqldf("Select * from Data_1Junto30Jun as a inner join flight_code as b on a.EB_FLNO1=b.EB_FLNO1")
Error Message: Error: cannot allocate vector of size 200.0 Mb

We have upgraded the RAM to 8 GB couple of months back. Can you let us know options to resolve the above issue without having to increase the RAM? The size of the datasets are as follows:

*        Data_1Junto30Jun (513476 obs of 32 variables). Data size - 172033368 bytes / 172 MB

*        flight_code (478105 obs of 2 variables). Data size - 3836304 bytes / 4 MB


Help with determining system requirement:
Is there a way to determine minimum system requirement (hardware and software) depending on size of the data, the way the data is loaded into R (directly from server or in a flat file) and the type of analysis to be run?
We have not been able to get any specific information related to this and are estimating the requirements through a trial and error method. Any information on this front will be helpful.


Thanks,
Ajay Rao
Strategy & Planning Group (SPG)
GMR Hyderabad International Airport Ltd.
Landline No: +91 40 66604051 / +91-9723465186
Email ID: AjayArvind.Rao at gmrgroup.in<mailto:AjayArvind.Rao at gmrgroup.in>


________________________________

This e-mail contains information which is confidential and/or legally privileged. If you are not the intended recipient , you are hereby notified that any disclosure, copying, distribution or the taking of any action in reliance on the contents of this e-mail is strictly prohibited. If you have received this e-mail in error, please destroy it and notify us by reply e-mail or by telephone. Internet E-mail messages may be subject to delays, non-delivery and unauthorised alterations and we shall not be responsible for the consequence(s) in such event(s). All reasonable precautions have been taken to ensure no viruses are present in this E-mail. We cannot accept responsibility for loss or damage arising from the use of this E-mail or attachments and recommend that you subject these to your virus checking procedures prior to use.
#
There are some packages for working with data "out of memory". See bigmemory and other "big*" packages. See also the data.table package which has many satisfied users.  There are also several packages for handling data through database connections. That would be probably the preferred method for your use case.

R objects are almost always copied when an assignment is made and this means that you need at a minimum at least twice as much free (and in  _continuous_ chunks) memory. You will often be breaking up the memory with other code and other out-of-R processes. Windows was in the past notorious for having poor memory management. I don't know if Windows 7 continued that tradition or whether later versions might be useful to avoid  the problem.

A dataframe will consume about 10 bytes per row for numeric columns. Factor and character vectors are hashed so the memory consumed will depend on the degree of duplication of entries. That will also affect the merge operations. Merges will give you a Cartesian product so if you merge two dataframes with lots of duplicates you will often get a message such as: "Error: cannot allocate vector of size 5.8 Gb"

The second error you cite suggests that much of your 8Gb of storage has been fragmented.

Most of this information should be available via searching in Rhelp or RSeek.
No difference for the source of data but cannot comment on the type of analysis because that part of the question is too vague. (Aside from mentioning the issue of Cartesian multiplication of merge results which often trips up new users of database technology.)
This suggests an impoverished ability for searching:

https://stackoverflow.com/search?q=%5Br%5D+memory+limitations

https://stackoverflow.com/search?q=%5Br%5D+memory+limitations+windows

http://markmail.org/search/?q=list%3Aorg.r-project.r-help+memory+limitations+windows