Skip to content
Prev 398064 / 398502 Next

About size of data frames

You know the old phrase that it is not the size that matters so much as how you use it?

The issue here is that lots of (sometimes temporary) copies and variations on your data may come into being as you manipulate the data. Sometimes R is fairly good at not making full copies of parts that can be common until one is modified. But something as simple as saving the results that match a query into another data.frame and then saving the ones that don't into another, ends up taking about twice as much room, at least until you rm() the original and let it be gradually garbage collected.

Sometimes you can aggressively remove things as needed, but who knows what happens in the various forms of pipeline as to when intermediate results go away. There are many operations you can use in ways that create  new column versus trying to overwrite an existing column. The data used can grow quite a bit -- especially if, like me, you do odd things and store complex objects like the results of a statistical calculation or a ggplot object ready to be printed, in a new column. 

For very large data, plenty of people have created extensions as packages that may be worthwhile but using a machine with enough memory may be reasonable enough. Otherwise, there are slow ploys in which you perform one or a few steps, write the results into a file, remove most of memory used, and read it in again, and so on.

And, for some purposes, it may make sense to just choose a reasonable and representative subset of the data to work with. What is reasonable depends on you and your situation. Can you get a reasonable snapshot of events by choosing one random measurement per day, for example?

Whatever you decide, I suggest you plan on graciously dealing with failure. Make a copy of your data and use that in your experiment so that any changes the program makes along the way do not create problems if the program fails for a reason like running out of memory. In that light, it might also be best to run it on a newly rebooted machine where as many applications as possible are not running simultaneously.

And, sometimes a solution is not to do it in R at all. Consider attaching to a database that can hold gigantic amounts of data and using R with a package that lets you run lots of queries on the database remotely. Only import the minimum needed onto your machine that must be done in R and put these results out of the way if they use lots of memory.

Just some thoughts. Good Luck.


-----Original Message-----
From: R-help <r-help-bounces at r-project.org> On Behalf Of Bert Gunter
Sent: Thursday, August 14, 2025 2:56 PM
To: Rui Barradas <ruipbarradas at sapo.pt>
Cc: r-help at r-project.org; Stefano Sofia <stefano.sofia at regione.marche.it>
Subject: Re: [R] About size of data frames

Rui, et. al. :
"real results probably depend on the functions
you want to apply to the data."

Indeed!
I would presume that one would want to analyze such data as time series of
some sort, for which I think long form is inherently "more sensible". If
so, I also would think that you would want columns for data, sensor*,
season, and day, as Rui suggested.  However, note that this presumes no
missing data, which is usually wrong. To handle this, within each day the
rows would need to be in order of the hour the data was recorded (I assume
twice per hour) with a missing code when data was missing.

*As an aside, whether <data, season, day> data is in the form of a single
data frame with an additional sensor ID column or a list of 70 frames, one
for each sensor,  is not really much of an issue these days, where
gigabytes and gigaflops are cheap and available, as it is trivial to
convert from one form to another as needed.

Feel free to disagree -- I am just amplifying Rui's comment above; "what I
would presume" and "what I think" doesn't matter. What matters is Stefano's
response to his comment.

Cheers,
Bert
On Thu, Aug 14, 2025 at 10:54?AM Rui Barradas <ruipbarradas at sapo.pt> wrote:

            
______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.