Hello R-world,
Please, help me to get round my little mess
I have a data.frame in which I'd rather like some values to be NA for the
future imputation process.
I've come up with the following piece of code:
random.del <- function (x, n.keeprows, del.percent){
n.items <- ncol(x)
k <- n.items*(del.percent/100)
x.del <- x
for (i in (n.keeprows+1):nrow(x)){
j <- sample(1:n.items, k)
x.del[i,j] <- NA
}
return (x.del)
}
The problems is that random.del turns out to be slow on huge samples.
Is there any other more effective/charming way to do the same?
Thanks,
Sergey
--
View this message in context: http://r.789695.n4.nabble.com/How-to-erase-replace-certain-elements-in-the-data-frame-tp3470883p3470883.html
Sent from the R help mailing list archive at Nabble.com.
How to erase (replace) certain elements in the data.frame?
6 messages · sneaffer, Joshua Wiley, Thomas Levine
This should do the same thing
random.del <- function (x, n.keeprows, del.percent){
? del<-function(col){
??? col[sample.int(length(col),length(col)*del.percent/100)]<-NA
??? col
? }
? change<-n.keeprows:nrow(x)
? x[change,]<-lapply(x[change,],del)
? x
}
This is faster because it's vectorized.
[1] "Mine"
user system elapsed
0.004 0.000 0.002
[1] "Yours"
user system elapsed
1.172 0.020 1.193
Tom
On Sat, Apr 23, 2011 at 8:37 PM, sneaffer <sneaffer at mail.ru> wrote:
Hello R-world,
Please, help me to get round my little mess
I have a data.frame in which I'd rather like some values to be NA for the
future imputation process.
I've come up with the following piece of code:
random.del <- function (x, n.keeprows, del.percent){
?n.items <- ncol(x)
?k <- n.items*(del.percent/100)
?x.del <- x
?for (i in (n.keeprows+1):nrow(x)){
? ?j <- sample(1:n.items, k)
? ?x.del[i,j] <- NA
?}
?return (x.del)
}
The problems is that random.del turns out to be slow on huge samples.
Is there any other more effective/charming way to do the same?
Thanks,
Sergey
--
View this message in context: http://r.789695.n4.nabble.com/How-to-erase-replace-certain-elements-in-the-data-frame-tp3470883p3470883.html
Sent from the R help mailing list archive at Nabble.com.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Hi Sergey, This is not an answer to your exact question, but can you use a matrix? If you can use a matrix instead of a data frame, you should get a considerable performance boost. Even for very large matrices (at least on my system), it is fast enough I find it hard to believe it is a bottle neck in the overall imputation process. For example, for a 1000 by 100 object as a data frame:
system.time(r0 <- random.del(mat, 100, 50))
user system elapsed 1.09 0.02 1.12 and as a matrix:
system.time(r0 <- random.del(mat, 100, 50))
user system elapsed
0.02 0.00 0.01
Beyond that, for very large objects, this revision gives a slight
(i.e., around 5 seconds for 1 million by 100 column object on my
system) performance increase, which is small for matrices and
completely dwarfed by other bottlenecks for data frames, at the cost
of readability/flexibility:
rdel <- function (x, n.keeprows, del.percent){
n.items <- ncol(x)
k <- as.integer(n.items * del.percent / 100)
cols <- 1:n.items
lcols <- length(cols)
for (i in (n.keeprows+1):nrow(x)){
j <- cols[.Internal(sample(lcols, k, FALSE, NULL))]
x[i,j] <- NA
}
return(x)
}
If you must use a data frame, you can gain some performance increase
(for a 10000 by 100 data frame, it takes about 30 seconds on my system
versus 40 for your original function) by using:
random.del2 <- function (x, n.keeprows, del.percent){
n.items <- ncol(x)
k <- n.items*(del.percent/100)
for (i in (n.keeprows+1):nrow(x)){
j <- sample(1:n.items, k)
`[<-.data.frame`(x, i, j, NA)
}
return(x)
}
which basically just saves R the trouble of figuring out which
assignment method to use. Of course the problem is that your function
becomes extremely specialized. If you pass anything to it but a data
frame, good things will not happen.
Cheers,
Josh
On Sat, Apr 23, 2011 at 5:37 PM, sneaffer <sneaffer at mail.ru> wrote:
Hello R-world,
Please, help me to get round my little mess
I have a data.frame in which I'd rather like some values to be NA for the
future imputation process.
I've come up with the following piece of code:
random.del <- function (x, n.keeprows, del.percent){
?n.items <- ncol(x)
?k <- n.items*(del.percent/100)
?x.del <- x
?for (i in (n.keeprows+1):nrow(x)){
? ?j <- sample(1:n.items, k)
? ?x.del[i,j] <- NA
?}
?return (x.del)
}
The problems is that random.del turns out to be slow on huge samples.
Is there any other more effective/charming way to do the same?
Thanks,
Sergey
--
View this message in context: http://r.789695.n4.nabble.com/How-to-erase-replace-certain-elements-in-the-data-frame-tp3470883p3470883.html
Sent from the R help mailing list archive at Nabble.com.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Joshua Wiley Ph.D. Student, Health Psychology University of California, Los Angeles http://www.joshuawiley.com/
Thanks a lot, guys. Thomas, your method is great, precisely the thing I've been looking forward to. Oh dear, how I love R for those list comprehension tricks! -- View this message in context: http://r.789695.n4.nabble.com/How-to-erase-replace-certain-elements-in-the-data-frame-tp3470883p3471380.html Sent from the R help mailing list archive at Nabble.com.
On Sat, Apr 23, 2011 at 11:35 PM, Thomas Levine <thomas.levine at gmail.com> wrote:
This should do the same thing
Did you actually test it? I get very different things.
random.del <- function (x, n.keeprows, del.percent){
? del<-function(col){
??? col[sample.int(length(col),length(col)*del.percent/100)]<-NA
??? col
? }
? change<-n.keeprows:nrow(x)
? x[change,]<-lapply(x[change,],del)
but a data frame is a list of vectors column wise, while Sergey's function went row by row. However, using sample.int() is a much better idea than what I did with sample().
? x } This is faster because it's vectorized.
but in such a way that you cannot guarantee the same number of cells
are missing from each row. Try:
rowSums(is.na("Mine"))
[1] "Mine" ? user ?system elapsed ?0.004 ? 0.000 ? 0.002 [1] "Yours" ? user ?system elapsed ?1.172 ? 0.020 ? 1.193 Tom On Sat, Apr 23, 2011 at 8:37 PM, sneaffer <sneaffer at mail.ru> wrote:
Hello R-world,
Please, help me to get round my little mess
I have a data.frame in which I'd rather like some values to be NA for the
future imputation process.
I've come up with the following piece of code:
random.del <- function (x, n.keeprows, del.percent){
?n.items <- ncol(x)
?k <- n.items*(del.percent/100)
?x.del <- x
?for (i in (n.keeprows+1):nrow(x)){
? ?j <- sample(1:n.items, k)
? ?x.del[i,j] <- NA
?}
?return (x.del)
}
The problems is that random.del turns out to be slow on huge samples.
Is there any other more effective/charming way to do the same?
Thanks,
Sergey
--
View this message in context: http://r.789695.n4.nabble.com/How-to-erase-replace-certain-elements-in-the-data-frame-tp3470883p3470883.html
Sent from the R help mailing list archive at Nabble.com.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Joshua Wiley Ph.D. Student, Health Psychology University of California, Los Angeles http://www.joshuawiley.com/
As Joshua said, mine was indeed different from yours. And it didn't
work on non-numeric data. But this one seems to work right:
random.del_vec <- function (x, n.keeprows, del.percent){
del<-function(notkeep){
k<-floor(length(notkeep)*del.percent/100)
notkeep[sample.int(length(notkeep),k)]<-NA
notkeep
}
change<-(n.keeprows+1):nrow(x)
x[change,]<-t(apply(x[change,],1,del))
x
}
On the other hand, maybe you really didn't want the stratification by row.
Tom
On Sun, Apr 24, 2011 at 8:31 AM, sneaffer <sneaffer at mail.ru> wrote:
Thanks a lot, guys. Thomas, your method is great, precisely the thing I've been looking forward to. Oh dear, how I love R for those list comprehension tricks! -- View this message in context: http://r.789695.n4.nabble.com/How-to-erase-replace-certain-elements-in-the-data-frame-tp3470883p3471380.html Sent from the R help mailing list archive at Nabble.com.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.