Skip to content

About size of data frames

8 messages · Duncan Murdoch, Rui Barradas, Bert Gunter +4 more

#
Dear R-list users,

let me ask you a very general question about performance of big data frames.

I deal with semi-hourly meteorological data of about 70 sensors during 28 winter seasons.


It means that for each sensor I have 48 data for each day, 181 days for each winter season (182 in case of leap year): 48 * 181 * 28 = 234,576

234,576 * 70 = 16420320
I personally would prefer the first choice, because it would be easier for me to deal with a single data frame and few columns.


Thank you for your usual help

Stefano


         (oo)
--oOO--( )--OOo--------------------------------------
Stefano Sofia MSc, PhD
Civil Protection Department - Marche Region - Italy
Meteo Section
Snow Section
Via Colle Ameno 5
60126 Torrette di Ancona, Ancona (AN)
Uff: +39 071 806 7743
E-mail: stefano.sofia at regione.marche.it
---Oo---------oO----------------------------------------

________________________________

AVVISO IMPORTANTE: Questo messaggio di posta elettronica pu? contenere informazioni confidenziali, pertanto ? destinato solo a persone autorizzate alla ricezione. I messaggi di posta elettronica per i client di Regione Marche possono contenere informazioni confidenziali e con privilegi legali. Se non si ? il destinatario specificato, non leggere, copiare, inoltrare o archiviare questo messaggio. Se si ? ricevuto questo messaggio per errore, inoltrarlo al mittente ed eliminarlo completamente dal sistema del proprio computer. Ai sensi dell'art. Ai sensi dell'art. 2.4 dell'allegato 1 alla DGR n. 74/2021, si segnala che, in caso di necessit? ed urgenza, la risposta al presente messaggio di posta elettronica pu? essere visionata da persone estranee al destinatario.
IMPORTANT NOTICE: This e-mail message is intended to be received only by persons entitled to receive the confidential information it may contain. E-mail messages to clients of Regione Marche may contain information that is confidential and legally privileged. Please do not read, copy, forward, or store this message unless you are an intended recipient of it. If you have received this message in error, please forward it to the sender and delete it completely from your computer system.
#
On 2025-08-14 7:27 a.m., Stefano Sofia via R-help wrote:
It really depends on what computations you're doing.  As a general rule, 
column operations are faster than row operations.  (Also as a general 
rule, arrays are faster than dataframes, but are much more limited in 
what they can hold:  all entries must be the same type, which probably 
won't work for your data.)

So I'd guess your 3 column solution would likely be best.

Duncan Murdoch
#
On 8/14/2025 12:27 PM, Stefano Sofia via R-help wrote:
Hello,

First of all, 48 * 181 * 28 = 243,264, not 234,576.
And 243264 * 70 = 17,028,480.

As for the question, why don't you try it with smaller data sets?
In the test bellow I have tested with the sizes you have posted and the 
many columns (wide format) is fastest. Then the df's list, then the 4 
columns (long format).
4 columns because it's sensor, day, season and data.
And the wide format df is only 72 columns wide, one for day, one for 
season and one for each sensor.

The test computes mean values aggregated by day and season. When the 
data is in the long format it must also include the sensor, so there is 
an extra aggregation column.

The test is very simple, real results probably depend on the functions 
you want to apply to the data.



# create the test data
makeDataLong <- function(sensor, x) {
   x[["data"]] <- rnorm(nrow(df1))
   cbind.data.frame(sensor, x)
}

makeDataWide <- function(sensor, x) {
   x[[sensor]] <- rnorm(nrow(x))
   x
}

set.seed(2025)

n_per_day <- 48
n_days <- 181
n_seasons <- 28
n_sensors <- 70

day <- rep(1:n_days, each = n_per_day)
season <- 1:n_seasons
sensor_names <- sprintf("sensor_%02d", 1:n_sensors)
df1 <- expand.grid(day = day, season = season, KEEP.OUT.ATTRS = FALSE)

df_list <- lapply(1:n_sensors, makeDataLong, x = df1)
names(df_list) <- sensor_names
df_long <- lapply(1:n_sensors, makeDataLong, x = df1) |> do.call(rbind, 
args = _)
df_wide <- df1
for(s in sensor_names) {
   df_wide <- makeDataWide(s, df_wide)
}


# test functions
f <- function(x) aggregate(data ~ season + day, data = x, mean)
g <- function(x) aggregate(data ~ sensor + season + day, data = x, mean)
h <- function(x) aggregate(. ~ season + day, x, mean)

# timings
bench::mark(
   list_base = lapply(df_list, f),
   long_base = g(df_long),
   wide_base = h(df_wide),
   check = FALSE
)



Hope this helps,

Rui Barradas
#
Rui, et. al. :
"real results probably depend on the functions
you want to apply to the data."

Indeed!
I would presume that one would want to analyze such data as time series of
some sort, for which I think long form is inherently "more sensible". If
so, I also would think that you would want columns for data, sensor*,
season, and day, as Rui suggested.  However, note that this presumes no
missing data, which is usually wrong. To handle this, within each day the
rows would need to be in order of the hour the data was recorded (I assume
twice per hour) with a missing code when data was missing.

*As an aside, whether <data, season, day> data is in the form of a single
data frame with an additional sensor ID column or a list of 70 frames, one
for each sensor,  is not really much of an issue these days, where
gigabytes and gigaflops are cheap and available, as it is trivial to
convert from one form to another as needed.

Feel free to disagree -- I am just amplifying Rui's comment above; "what I
would presume" and "what I think" doesn't matter. What matters is Stefano's
response to his comment.

Cheers,
Bert
On Thu, Aug 14, 2025 at 10:54?AM Rui Barradas <ruipbarradas at sapo.pt> wrote:

            

  
  
#
You know the old phrase that it is not the size that matters so much as how you use it?

The issue here is that lots of (sometimes temporary) copies and variations on your data may come into being as you manipulate the data. Sometimes R is fairly good at not making full copies of parts that can be common until one is modified. But something as simple as saving the results that match a query into another data.frame and then saving the ones that don't into another, ends up taking about twice as much room, at least until you rm() the original and let it be gradually garbage collected.

Sometimes you can aggressively remove things as needed, but who knows what happens in the various forms of pipeline as to when intermediate results go away. There are many operations you can use in ways that create  new column versus trying to overwrite an existing column. The data used can grow quite a bit -- especially if, like me, you do odd things and store complex objects like the results of a statistical calculation or a ggplot object ready to be printed, in a new column. 

For very large data, plenty of people have created extensions as packages that may be worthwhile but using a machine with enough memory may be reasonable enough. Otherwise, there are slow ploys in which you perform one or a few steps, write the results into a file, remove most of memory used, and read it in again, and so on.

And, for some purposes, it may make sense to just choose a reasonable and representative subset of the data to work with. What is reasonable depends on you and your situation. Can you get a reasonable snapshot of events by choosing one random measurement per day, for example?

Whatever you decide, I suggest you plan on graciously dealing with failure. Make a copy of your data and use that in your experiment so that any changes the program makes along the way do not create problems if the program fails for a reason like running out of memory. In that light, it might also be best to run it on a newly rebooted machine where as many applications as possible are not running simultaneously.

And, sometimes a solution is not to do it in R at all. Consider attaching to a database that can hold gigantic amounts of data and using R with a package that lets you run lots of queries on the database remotely. Only import the minimum needed onto your machine that must be done in R and put these results out of the way if they use lots of memory.

Just some thoughts. Good Luck.


-----Original Message-----
From: R-help <r-help-bounces at r-project.org> On Behalf Of Bert Gunter
Sent: Thursday, August 14, 2025 2:56 PM
To: Rui Barradas <ruipbarradas at sapo.pt>
Cc: r-help at r-project.org; Stefano Sofia <stefano.sofia at regione.marche.it>
Subject: Re: [R] About size of data frames

Rui, et. al. :
"real results probably depend on the functions
you want to apply to the data."

Indeed!
I would presume that one would want to analyze such data as time series of
some sort, for which I think long form is inherently "more sensible". If
so, I also would think that you would want columns for data, sensor*,
season, and day, as Rui suggested.  However, note that this presumes no
missing data, which is usually wrong. To handle this, within each day the
rows would need to be in order of the hour the data was recorded (I assume
twice per hour) with a missing code when data was missing.

*As an aside, whether <data, season, day> data is in the form of a single
data frame with an additional sensor ID column or a list of 70 frames, one
for each sensor,  is not really much of an issue these days, where
gigabytes and gigaflops are cheap and available, as it is trivial to
convert from one form to another as needed.

Feel free to disagree -- I am just amplifying Rui's comment above; "what I
would presume" and "what I think" doesn't matter. What matters is Stefano's
response to his comment.

Cheers,
Bert
On Thu, Aug 14, 2025 at 10:54?AM Rui Barradas <ruipbarradas at sapo.pt> wrote:

            
______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
#
"Sensor" is too generic to be helpful in offering guidance here.

1) A temperature sensor and a wind speed sensor are both sensors... but the code used to analyze them is usually quite different. It is not usually advantageous to stack different kinds of data (though sometimes you may want to facet different types of data into a trend plot from a single data frame, so ymmv). Sometimes even the same sensor applied in different environments (temperature in a house vs outside air temperature) needs to be kept in separate columns.

2) Some analyses assume a sequential-in-time structure... the long organization you are contemplating likely has repeating time sequences... you will end up looking at segments of the values column and it may be easier to do that in wide form. 

3) Other analyses are independent of time... e.g. temperature greater than 30degC... a single column containing only those sensors is ideal for that.
On August 14, 2025 11:55:34 AM PDT, Bert Gunter <bgunter.4567 at gmail.com> wrote:

  
    
#
great question, and one that touches on both performance and usability in R. Here's a breakdown of the trade-offs and recommendations:

You're comparing three data structure strategies for handling ~16.5 million observations:

- Single long data frame, ~16.5M rows ? 3 columns,  Simple to manage, easy to filter/group, tidyverse-friendly,  May require more memory; slower row-wise operations 
- Wide data frame,  ~235K rows ? 141 columns,  Fast column-wise operations; good for matrix-style analysis, to reshape/filter; less tidy 
- List of 70 data frames,  Each ~235K rows ? 3 columns,  Parallel processing possible; modular,  Complex to manage; harder to aggregate or compare 

Performance Considerations
- Memory Efficiency: A single long data frame is generally more memory-efficient than a list of data frames, especially if column types are consistent.
- Vectorization: R is optimized for vectorized operations. A long format works well with dplyr, data.table, and tidyverse tools.
- Parallelism: If you plan to process each sensor independently, a list of data frames could allow parallel computation using future, furrr, or parallel.
- Reshaping Costs: Wide formats are fast for matrix-style operations but can be cumbersome when filtering by time, sensor, or value.

I'd stick with the single long-format data frame:
- It aligns with tidy data principles.
- It's easier to filter, group, and summarize.
- It integrates seamlessly with packages like ggplot2, dplyr, and data.table.

If performance becomes an issue:
- Consider converting to a data.table object (setDT(df)), which is highly optimized for large datasets.
- Use indexing and keys for faster filtering.
- Use arrow::read_parquet() or fst::write_fst() for fast disk I/O if you need to save/load frequently.

If you're doing seasonal analysis, consider adding a season column. That way, you can easily group by sensor, season, and day without needing to split the data.


-----Original Message-----
From: R-help <r-help-bounces at r-project.org> On Behalf Of Stefano Sofia via R-help
Sent: Thursday, August 14, 2025 6:27 AM
To: r-help at R-project.org
Subject: [R] About size of data frames

Dear R-list users,

let me ask you a very general question about performance of big data frames.

I deal with semi-hourly meteorological data of about 70 sensors during 28 winter seasons.


It means that for each sensor I have 48 data for each day, 181 days for each winter season (182 in case of leap year): 48 * 181 * 28 = 234,576

234,576 * 70 = 16420320
I personally would prefer the first choice, because it would be easier for me to deal with a single data frame and few columns.


Thank you for your usual help

Stefano


         (oo)
--oOO--( )--OOo--------------------------------------
Stefano Sofia MSc, PhD
Civil Protection Department - Marche Region - Italy Meteo Section Snow Section Via Colle Ameno 5
60126 Torrette di Ancona, Ancona (AN)
Uff: +39 071 806 7743
E-mail: stefano.sofia at regione.marche.it
---Oo---------oO----------------------------------------

________________________________

AVVISO IMPORTANTE: Questo messaggio di posta elettronica pu  contenere informazioni confidenziali, pertanto   destinato solo a persone autorizzate alla ricezione. I messaggi di posta elettronica per i client di Regione Marche possono contenere informazioni confidenziali e con privilegi legali. Se non si   il destinatario specificato, non leggere, copiare, inoltrare o archiviare questo messaggio. Se si   ricevuto questo messaggio per errore, inoltrarlo al mittente ed eliminarlo completamente dal sistema del proprio computer. Ai sensi dell'art. Ai sensi dell'art. 2.4 dell'allegato 1 alla DGR n. 74/2021, si segnala che, in caso di necessit  ed urgenza, la risposta al presente messaggio di posta elettronica pu  essere visionata da persone estranee al destinatario.
IMPORTANT NOTICE: This e-mail message is intended to be received only by persons entitled to receive the confidential information it may contain. E-mail messages to clients of Regione Marche may contain information that is confidential and legally privileged. Please do not read, copy, forward, or store this message unless you are an intended recipient of it. If you have received this message in error, please forward it to the sender and delete it completely from your computer system.
3 days later
#
Thanks to all of you.

It's great to interact with you, your comments are opportunities to learn more not only about the specific posted question, but also about many other related topics.

Most of comments agree on the single long-format data frame, and Jeff's synthesis has been particularly interesting.

I run R in a server, which is well maintained and most likely faster than my pc.


The main variables I am dealing with are snow-pach height and daily snow-fall amount; as support to these two measurements there are many other meteorological parameters (such as wind direction, wind speed, air temperature, theta-e air temperature, surface snow-pack temperature, incident radiation, reflected radiation).

The frequency of the new sensors is getting higher and higher (at the time being is 10 minutes and in case of emergency can swap to 5 minutes!), I spent a lot of efforts to "normalize" data to half-hourly frequecy.


I use this data for several different purposes, the most important are

- graphical comparisons for manual validation (these comparisons may take into account different sensors of a single meteorological station or the same sensor for several meteorological stations)

- studying some regressions that may result important

- climatological studies


A single data frame is easy to handle, this is what I've been doing so far.

Yes, in few years time my initial data frame will pass the 20M rows, it will always be a concern.


Thank you again for everything

Stefano


         (oo)
--oOO--( )--OOo--------------------------------------
Stefano Sofia MSc, PhD
Civil Protection Department - Marche Region - Italy
Meteo Section
Snow Section
Via Colle Ameno 5
60126 Torrette di Ancona, Ancona (AN)
Uff: +39 071 806 7743
E-mail: stefano.sofia at regione.marche.it
---Oo---------oO----------------------------------------