Hi R-helpers,
Does anyone know why adding which() makes the select call more
efficient than just using logical selection in a dataframe? Doesn't
which() technically add another conversion/function call on top of the
logical selection? Here is a reproducible example with a slight
difference in timing.
# Surrogate data - the timing here isn't interesting
urltext <- paste("https://drive.google.com/",
"uc?id=1AZ-s1EgZXs4M_XF3YYEaKjjMMvRQ7",
"-h8&export=download", sep="")
download.file(url=urltext, destfile="tempfile.csv") # download file first
dat <- read.csv("tempfile.csv", stringsAsFactors = FALSE, header=TRUE,
nrows=2.5e6) # read the file; 'nrows' is a slight
# overestimate
dat <- dat[,1:3] # select just the first 3 columns
head(dat, 10) # print the first 10 rows
# Select using which() as the final step ~ 90ms total time on my macbook air
system.time(
head(
dat[which(dat$gender2=="other"),],),
gcFirst=TRUE)
# Select skipping which() ~130ms total time
system.time(
head(
dat[dat$gender2=="other", ]),
gcFirst=TRUE)
Now I would think that the second one without which() would be more
efficient. However, every time I run these, the first version, with
which() is more efficient by about 20ms of system time and 20ms of
user time. Does anyone know why this is?
Cheers!
Keith
which() vs. just logical selection in df
2 messages · kMan, Greg Snow
1 day later
I would suggest using the microbenchmark package to do the time comparison. This will run each a bunch of times for a more meaningful comparison. One possible reason for the difference is the number of missing values in your data (along with the number of columns). Consider the difference in the following results:
x <- c(1,2,NA) x[x==1]
[1] 1 NA
x[which(x==1)]
[1] 1
On Sat, Oct 10, 2020 at 5:25 PM 1/k^c <kchamberln at gmail.com> wrote:
Hi R-helpers,
Does anyone know why adding which() makes the select call more
efficient than just using logical selection in a dataframe? Doesn't
which() technically add another conversion/function call on top of the
logical selection? Here is a reproducible example with a slight
difference in timing.
# Surrogate data - the timing here isn't interesting
urltext <- paste("https://drive.google.com/",
"uc?id=1AZ-s1EgZXs4M_XF3YYEaKjjMMvRQ7",
"-h8&export=download", sep="")
download.file(url=urltext, destfile="tempfile.csv") # download file first
dat <- read.csv("tempfile.csv", stringsAsFactors = FALSE, header=TRUE,
nrows=2.5e6) # read the file; 'nrows' is a slight
# overestimate
dat <- dat[,1:3] # select just the first 3 columns
head(dat, 10) # print the first 10 rows
# Select using which() as the final step ~ 90ms total time on my macbook air
system.time(
head(
dat[which(dat$gender2=="other"),],),
gcFirst=TRUE)
# Select skipping which() ~130ms total time
system.time(
head(
dat[dat$gender2=="other", ]),
gcFirst=TRUE)
Now I would think that the second one without which() would be more
efficient. However, every time I run these, the first version, with
which() is more efficient by about 20ms of system time and 20ms of
user time. Does anyone know why this is?
Cheers!
Keith
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Gregory (Greg) L. Snow Ph.D. 538280 at gmail.com