Dear All,
I have a following for-loop code which is basically intended to read in
many excel files (each file has many columns and rows) in a directory and
extract the some rows and columns out of each file and then combine them
together into a dataframe. I use for loop which can do the work but quite
slow. How to make it faster using lapply function ? Thanks in advance!
temp.df<-c() # create an empty list to store the extracted result from each
excel file inside for-loop
for (i in list.files()) { # loop through each excel file in the directory
temp<-read_xlsx(i,sheet=1,range=cell_cols(c(1,30,38:42))) # from package
"readxl" to read in excel file
temp<-temp[grep("^geneA$|^geneB$|^geneC$",temp$Id),] # extract rows
based on temp$id
names(temp)<-gsub("^.*] ","",names(temp)) # clean up column names
temp.df<-append(temp.df, list(as.data.frame(temp))) # change the
dataframe to list, so it can be append to list.
if (i == list.files()[length(list.files())]){ # if it is last excel file,
then combine all the rows in the list into a dataframe because they all
have same column names
temp.df.all<-do.call("rbind",temp.df)
write_xlsx(temp.df.all, path="output.xlsx") # write_xlsx from package
writexl.
}
}
*Stephen*
how to make the code more efficient using lapply
4 messages · Stephen HonKit Wong, Eric Berger, MacQueen, Don +1 more
Hi Stephen,
I am not sure that the "for loop" is the source of slowness.
You seem to be doing a lot of unnecessary work each time through the loop.
e.g. no need to check if it's the last file, just move that section outside
of the loop.
It will be executed when the loop finishes. As it is you are calling
list.files() each time
through the loop which could be slow.
In any case here's a possible way to do it. Warning: untested!
f <- function(fn) {
temp<-read_xlsx(fn,sheet=1,range=cell_cols(c(1,30,38:42)))
temp<-temp[temp$Id %in% c("geneA","geneB","geneC"),]
}
myL <- lapply( X=list.files(), FUN=f )
temp.df.all<-do.call("rbind",myL)
names(temp.df.all)<-gsub("^.*] ","",names(temp.df.all))
write_xlsx(temp.df.all, path="output.xlsx")
HTH,
Eric
On Fri, May 25, 2018 at 9:24 AM, Stephen HonKit Wong <stephen66 at gmail.com>
wrote:
Dear All,
I have a following for-loop code which is basically intended to read in
many excel files (each file has many columns and rows) in a directory and
extract the some rows and columns out of each file and then combine them
together into a dataframe. I use for loop which can do the work but quite
slow. How to make it faster using lapply function ? Thanks in advance!
temp.df<-c() # create an empty list to store the extracted result from each
excel file inside for-loop
for (i in list.files()) { # loop through each excel file in the directory
temp<-read_xlsx(i,sheet=1,range=cell_cols(c(1,30,38:42))) # from
package
"readxl" to read in excel file
temp<-temp[grep("^geneA$|^geneB$|^geneC$",temp$Id),] # extract rows
based on temp$id
names(temp)<-gsub("^.*] ","",names(temp)) # clean up column names
temp.df<-append(temp.df, list(as.data.frame(temp))) # change the
dataframe to list, so it can be append to list.
if (i == list.files()[length(list.files())]){ # if it is last excel
file,
then combine all the rows in the list into a dataframe because they all
have same column names
temp.df.all<-do.call("rbind",temp.df)
write_xlsx(temp.df.all, path="output.xlsx") # write_xlsx from package
writexl.
}
}
*Stephen*
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/ posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Eric's approach seems reasonable to me, and I agree that it's probably not the use of a "for" loop that makes the original version slow. As Eric mentioned, there are lots of unnecessary things happening in the loop.
For example, list.files() was called twice inside the loop, which is unnecessary, and will definitely slow the loop down (though probably not by much). Call it outside the loop, save the results in a vector, and use the vector inside the loop.
Here's another way (also untested).
infiles <- list.files()
nfiles <- length(infiles)
## read the first file
dfall <- read_xlsx(infiles[1], sheet=1, range=cell_cols(c(1,30,38:42)))
dfall <- dfall[dfall$Id %in% c("geneA","geneB","geneC") , ]
## I'm going to assume the colnames are all the same on input
## if that's wrong, then they have to be fixed inside the loop
## read the remaining files, appending their contents each time
for (ifl in 2:nfiles) {
temp <- read_xlsx(infiles[ifl], sheet=1, range=cell_cols(c(1,30,38:42)))
dfall <- rbind( dfall, temp[temp$Id %in% c("geneA","geneB","geneC") , ] )
}
## fix the colnames here
## write the output file here.
In Eric's approach (which I have sometimes used), all of the input data frames are stored in myL.
This has some benefit, but is strictly speaking not necessary (but would not be a concern unless the files are huge).
In my alternative approach, the contents of each input file are discarded after they have been appended to the previous ones.
The data frame (dfall) is enlarged at each iteration. Many times I have seen recommendations against this.
The help page for read_xlsx says it returns a tibble. I've never had a need to learn about tibbles, but there may be some overhead in creating a tibble instead of a data frame that is slowing down the loop. There are other packages that read Excel files that create data frames. Maybe they would be faster. I don't see anything in the original question to indicate that tibbles are needed.
Potentially important: are any of your columns character strings in the Excel file? Are they being converted to factors by read_xlsx()? If so, I would suggest preventing that.
Going back to the original code, the correct way to initialize a list is not
temp.df<-c() # create an empty list to store the extracted result from each excel file inside for-loop
Instead, use
temp.df <- vector("list", nfiles)
However, in that case, one would not append new elements to the list inside the loop. One would assign them to existing elements using an element index.
Regarding this
temp.df<-append(temp.df, list(as.data.frame(temp))) # change the dataframe to list, so it can be append to list.
The original question described temp as a data frame, so using as.data.frame(temp) would do nothing, and probably makes the loop take a little longer. But if temp is a tibble, perhaps this step is needed (and if needed, takes a little bit of time, thus slowing the loop).
The expression
list(as.data.frame(temp))
does not convert temp to a list. It puts temp into the first element of a new list.
If temp.df is a list with some number of elements, then a simpler way to append a new element would be
temp.df <- c( temp.df, list( temp) )
append() is a (slightly?) more complex function than c(), so might be slower.
If all of the data is numeric, or all of it is character, and the loop still takes too long with Eric's version or mine, there might be a speedup from converting to a matrix and using rbind() on matrices. Data frames have some overhead that matrices don't, especially if factors are involved.
-Don
--
Don MacQueen
Lawrence Livermore National Laboratory
7000 East Ave., L-627
Livermore, CA 94550
925-423-1062
Lab cell 925-724-7509
?On 5/25/18, 12:21 AM, "R-help on behalf of Eric Berger" <r-help-bounces at r-project.org on behalf of ericjberger at gmail.com> wrote:
Hi Stephen,
I am not sure that the "for loop" is the source of slowness.
You seem to be doing a lot of unnecessary work each time through the loop.
e.g. no need to check if it's the last file, just move that section outside
of the loop.
It will be executed when the loop finishes. As it is you are calling
list.files() each time
through the loop which could be slow.
In any case here's a possible way to do it. Warning: untested!
f <- function(fn) {
temp<-read_xlsx(fn,sheet=1,range=cell_cols(c(1,30,38:42)))
temp<-temp[temp$Id %in% c("geneA","geneB","geneC"),]
}
myL <- lapply( X=list.files(), FUN=f )
temp.df.all<-do.call("rbind",myL)
names(temp.df.all)<-gsub("^.*] ","",names(temp.df.all))
write_xlsx(temp.df.all, path="output.xlsx")
HTH,
Eric
On Fri, May 25, 2018 at 9:24 AM, Stephen HonKit Wong <stephen66 at gmail.com>
wrote:
> Dear All,
>
> I have a following for-loop code which is basically intended to read in
> many excel files (each file has many columns and rows) in a directory and
> extract the some rows and columns out of each file and then combine them
> together into a dataframe. I use for loop which can do the work but quite
> slow. How to make it faster using lapply function ? Thanks in advance!
>
>
>
> temp.df<-c() # create an empty list to store the extracted result from each
> excel file inside for-loop
>
>
> for (i in list.files()) { # loop through each excel file in the directory
>
> temp<-read_xlsx(i,sheet=1,range=cell_cols(c(1,30,38:42))) # from
> package
> "readxl" to read in excel file
>
> temp<-temp[grep("^geneA$|^geneB$|^geneC$",temp$Id),] # extract rows
> based on temp$id
>
> names(temp)<-gsub("^.*] ","",names(temp)) # clean up column names
>
> temp.df<-append(temp.df, list(as.data.frame(temp))) # change the
> dataframe to list, so it can be append to list.
>
> if (i == list.files()[length(list.files())]){ # if it is last excel
> file,
> then combine all the rows in the list into a dataframe because they all
> have same column names
>
> temp.df.all<-do.call("rbind",temp.df)
>
> write_xlsx(temp.df.all, path="output.xlsx") # write_xlsx from package
> writexl.
>
> }
>
> }
>
>
>
>
> *Stephen*
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
... and, of course, the original premise is false. apply() type statements **are** loops at the interpreter level and are typically not appreciably faster -- sometimes even a bit slower -- than explicit for() loops. Their chief advantage is adherence to a functional programming paradigm that for many of us means greater code clarity. Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Fri, May 25, 2018 at 9:09 AM, MacQueen, Don via R-help <
r-help at r-project.org> wrote:
Eric's approach seems reasonable to me, and I agree that it's probably not
the use of a "for" loop that makes the original version slow. As Eric
mentioned, there are lots of unnecessary things happening in the loop.
For example, list.files() was called twice inside the loop, which is
unnecessary, and will definitely slow the loop down (though probably not by
much). Call it outside the loop, save the results in a vector, and use the
vector inside the loop.
Here's another way (also untested).
infiles <- list.files()
nfiles <- length(infiles)
## read the first file
dfall <- read_xlsx(infiles[1], sheet=1, range=cell_cols(c(1,30,38:42)))
dfall <- dfall[dfall$Id %in% c("geneA","geneB","geneC") , ]
## I'm going to assume the colnames are all the same on input
## if that's wrong, then they have to be fixed inside the loop
## read the remaining files, appending their contents each time
for (ifl in 2:nfiles) {
temp <- read_xlsx(infiles[ifl], sheet=1, range=cell_cols(c(1,30,38:42)))
dfall <- rbind( dfall, temp[temp$Id %in% c("geneA","geneB","geneC") , ] )
}
## fix the colnames here
## write the output file here.
In Eric's approach (which I have sometimes used), all of the input data
frames are stored in myL.
This has some benefit, but is strictly speaking not necessary (but would
not be a concern unless the files are huge).
In my alternative approach, the contents of each input file are discarded
after they have been appended to the previous ones.
The data frame (dfall) is enlarged at each iteration. Many times I have
seen recommendations against this.
The help page for read_xlsx says it returns a tibble. I've never had a
need to learn about tibbles, but there may be some overhead in creating a
tibble instead of a data frame that is slowing down the loop. There are
other packages that read Excel files that create data frames. Maybe they
would be faster. I don't see anything in the original question to indicate
that tibbles are needed.
Potentially important: are any of your columns character strings in the
Excel file? Are they being converted to factors by read_xlsx()? If so, I
would suggest preventing that.
Going back to the original code, the correct way to initialize a list is
not
temp.df<-c() # create an empty list to store the extracted result from
each excel file inside for-loop
Instead, use
temp.df <- vector("list", nfiles)
However, in that case, one would not append new elements to the list
inside the loop. One would assign them to existing elements using an
element index.
Regarding this
temp.df<-append(temp.df, list(as.data.frame(temp))) # change the
dataframe to list, so it can be append to list.
The original question described temp as a data frame, so using
as.data.frame(temp) would do nothing, and probably makes the loop take a
little longer. But if temp is a tibble, perhaps this step is needed (and if
needed, takes a little bit of time, thus slowing the loop).
The expression
list(as.data.frame(temp))
does not convert temp to a list. It puts temp into the first element of a
new list.
If temp.df is a list with some number of elements, then a simpler way to
append a new element would be
temp.df <- c( temp.df, list( temp) )
append() is a (slightly?) more complex function than c(), so might be
slower.
If all of the data is numeric, or all of it is character, and the loop
still takes too long with Eric's version or mine, there might be a speedup
from converting to a matrix and using rbind() on matrices. Data frames have
some overhead that matrices don't, especially if factors are involved.
-Don
--
Don MacQueen
Lawrence Livermore National Laboratory
7000 East Ave., L-627
Livermore, CA 94550
925-423-1062
Lab cell 925-724-7509
?On 5/25/18, 12:21 AM, "R-help on behalf of Eric Berger" <
r-help-bounces at r-project.org on behalf of ericjberger at gmail.com> wrote:
Hi Stephen,
I am not sure that the "for loop" is the source of slowness.
You seem to be doing a lot of unnecessary work each time through the
loop.
e.g. no need to check if it's the last file, just move that section
outside
of the loop.
It will be executed when the loop finishes. As it is you are calling
list.files() each time
through the loop which could be slow.
In any case here's a possible way to do it. Warning: untested!
f <- function(fn) {
temp<-read_xlsx(fn,sheet=1,range=cell_cols(c(1,30,38:42)))
temp<-temp[temp$Id %in% c("geneA","geneB","geneC"),]
}
myL <- lapply( X=list.files(), FUN=f )
temp.df.all<-do.call("rbind",myL)
names(temp.df.all)<-gsub("^.*] ","",names(temp.df.all))
write_xlsx(temp.df.all, path="output.xlsx")
HTH,
Eric
On Fri, May 25, 2018 at 9:24 AM, Stephen HonKit Wong <
stephen66 at gmail.com>
wrote:
> Dear All,
>
> I have a following for-loop code which is basically intended to read
in
> many excel files (each file has many columns and rows) in a
directory and
> extract the some rows and columns out of each file and then combine
them
> together into a dataframe. I use for loop which can do the work but
quite
> slow. How to make it faster using lapply function ? Thanks in
advance!
>
>
>
> temp.df<-c() # create an empty list to store the extracted result
from each
> excel file inside for-loop
>
>
> for (i in list.files()) { # loop through each excel file in the
directory
>
> temp<-read_xlsx(i,sheet=1,range=cell_cols(c(1,30,38:42))) # from
> package
> "readxl" to read in excel file
>
> temp<-temp[grep("^geneA$|^geneB$|^geneC$",temp$Id),] # extract
rows
> based on temp$id
>
> names(temp)<-gsub("^.*] ","",names(temp)) # clean up column names
>
> temp.df<-append(temp.df, list(as.data.frame(temp))) # change the
> dataframe to list, so it can be append to list.
>
> if (i == list.files()[length(list.files())]){ # if it is last
excel
> file,
> then combine all the rows in the list into a dataframe because they
all
> have same column names
>
> temp.df.all<-do.call("rbind",temp.df)
>
> write_xlsx(temp.df.all, path="output.xlsx") # write_xlsx from
package
> writexl.
>
> }
>
> }
>
>
>
>
> *Stephen*
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]
______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/
posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/
posting-guide.html
and provide commented, minimal, self-contained, reproducible code.