Skip to content
Prev 398062 / 398502 Next

About size of data frames

On 8/14/2025 12:27 PM, Stefano Sofia via R-help wrote:
Hello,

First of all, 48 * 181 * 28 = 243,264, not 234,576.
And 243264 * 70 = 17,028,480.

As for the question, why don't you try it with smaller data sets?
In the test bellow I have tested with the sizes you have posted and the 
many columns (wide format) is fastest. Then the df's list, then the 4 
columns (long format).
4 columns because it's sensor, day, season and data.
And the wide format df is only 72 columns wide, one for day, one for 
season and one for each sensor.

The test computes mean values aggregated by day and season. When the 
data is in the long format it must also include the sensor, so there is 
an extra aggregation column.

The test is very simple, real results probably depend on the functions 
you want to apply to the data.



# create the test data
makeDataLong <- function(sensor, x) {
   x[["data"]] <- rnorm(nrow(df1))
   cbind.data.frame(sensor, x)
}

makeDataWide <- function(sensor, x) {
   x[[sensor]] <- rnorm(nrow(x))
   x
}

set.seed(2025)

n_per_day <- 48
n_days <- 181
n_seasons <- 28
n_sensors <- 70

day <- rep(1:n_days, each = n_per_day)
season <- 1:n_seasons
sensor_names <- sprintf("sensor_%02d", 1:n_sensors)
df1 <- expand.grid(day = day, season = season, KEEP.OUT.ATTRS = FALSE)

df_list <- lapply(1:n_sensors, makeDataLong, x = df1)
names(df_list) <- sensor_names
df_long <- lapply(1:n_sensors, makeDataLong, x = df1) |> do.call(rbind, 
args = _)
df_wide <- df1
for(s in sensor_names) {
   df_wide <- makeDataWide(s, df_wide)
}


# test functions
f <- function(x) aggregate(data ~ season + day, data = x, mean)
g <- function(x) aggregate(data ~ sensor + season + day, data = x, mean)
h <- function(x) aggregate(. ~ season + day, x, mean)

# timings
bench::mark(
   list_base = lapply(df_list, f),
   long_base = g(df_long),
   wide_base = h(df_wide),
   check = FALSE
)



Hope this helps,

Rui Barradas