Hi R users,
I really need help with subsetting data frames:
I have a large database of medical records and I want to be able to match
patterns from a list of search terms .
I've used this simplified data frame in a previous example:
db <- structure(list(ind = c("ind1", "ind2", "ind3", "ind4"), test1 = c(1,
2, 1.3, 3), test2 = c(56L, 27L, 58L, 2L), test3 = c(1.1, 28,
9, 1.2)), .Names = c("ind", "test1", "test2", "test3"), class =
"data.frame", row.names = c(NA,
-4L))
terms_include <- c("1","2","3")
terms_exclude <- c("1.1","1.2","1.3")
So in this example I want to include all the terms from terms include as
long as they don't occur with terms exclude in the same row of the data
frame.
Previously I was given this function which works very well if you want to
match exactly:
f <- function(x) !any(x %in% terms_exclude) && any(x %in% terms_include)
db[apply(db[, -1], 1, f), ]
ind test1 test2 test3
2 ind2 2 27 28.0
4 ind4 3 2 1.2
I would like to know if there is a way to write a similar function that
looks for matches that start with the query string: as in
grepl("^pattern",x)
I started writing a function but am not sure how to get it to return the
dataframe or matrix:
for (i in 1:length(terms_include)){
db_new <- apply(db,2, grepl,pattern=i)
}
Applying this function gives me:
db_new <- structure(c(FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE), .Dim = c(4L,
4L), .Dimnames = list(NULL, c("ind", "test1", "test2", "test3"
)))
So the above is searching the pattern anywhere in the dataframe instead of
just at the beginning of the string.
How would I incorporate look for terms to include but don't return the row
of the data frame if it also includes one of the terms to exclude while
using partial matching?
I hope that this makes sense.
Many thanks,
Natalie
-----
Natalie Van Zuydam
PhD Student
University of Dundee
nvanzuydam at dundee.ac.uk
--
View this message in context: http://r.789695.n4.nabble.com/Subsetting-a-data-frame-tp4160127p4160127.html
Sent from the R help mailing list archive at Nabble.com.
Subsetting a data frame
2 messages · natalie.vanzuydam, jim holtman
does this do what you want:
db <- structure(list(ind = c("ind1", "ind2", "ind3", "ind4"), test1 = c(1,
+ 2, 1.3, 3), test2 = c(56L, 27L, 58L, 2L), test3 = c(1.1, 28,
+ 9, 1.2)), .Names = c("ind", "test1", "test2", "test3"), class =
+ "data.frame", row.names = c(NA,
+ -4L))
terms_include <- c("1","2","3")
terms_exclude <- c("1.1","1.2","1.3")
f.match <- function(obj, inc, exc){
+ pat <- paste("^(", paste(inc, collapse = "|"), ")", sep = '')
+ patex <- paste(exc, collapse = "|")
+ isMatch <- apply(obj, 1, function(x) any(grepl(pat, x)))
+ notMatch <- !apply(obj, 1, function(x) any(grepl(patex, x)))
+ obj[isMatch & notMatch,]
+ }
db
ind test1 test2 test3 1 ind1 1.0 56 1.1 2 ind2 2.0 27 28.0 3 ind3 1.3 58 9.0 4 ind4 3.0 2 1.2
f.match(db, terms_include, terms_exclude)
ind test1 test2 test3 2 ind2 2 27 28
On Mon, Dec 5, 2011 at 6:32 AM, natalie.vanzuydam <nvanzuydam at gmail.com> wrote:
Hi R users,
I really need help with subsetting ?data frames:
I have a large database of medical records and I want to be able to match
patterns from a list of search terms .
I've used this simplified data frame in a previous example:
db <- structure(list(ind = c("ind1", "ind2", "ind3", "ind4"), test1 = c(1,
2, 1.3, 3), test2 = c(56L, 27L, 58L, 2L), test3 = c(1.1, 28,
9, 1.2)), .Names = c("ind", "test1", "test2", "test3"), class =
"data.frame", row.names = c(NA,
-4L))
terms_include <- c("1","2","3")
terms_exclude <- c("1.1","1.2","1.3")
So in this example I want to include all the terms from terms include as
long as they don't occur with terms exclude in the same row of the data
frame.
Previously I was given this function which works very well if you want to
match exactly:
f <- function(x) ?!any(x %in% terms_exclude) && any(x %in% terms_include)
db[apply(db[, -1], 1, f), ]
? ind test1 test2 test3
2 ind2 ? ? 2 ? ?27 ?28.0
4 ind4 ? ? 3 ? ? 2 ? 1.2
I would like to know if there is a way to write a similar function that
looks for matches that start with the query string: ?as in
grepl("^pattern",x)
I started writing a function but am not sure how to get it to return the
dataframe or matrix:
for (i in 1:length(terms_include)){
db_new <- apply(db,2, grepl,pattern=i)
}
Applying this function gives me:
db_new <- structure(c(FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE), .Dim = c(4L,
4L), .Dimnames = list(NULL, c("ind", "test1", "test2", "test3"
)))
So the above is searching the pattern anywhere in the dataframe instead of
just at the beginning of the string.
How would I incorporate look for terms to include but don't return the row
of the data frame if it also includes one of the terms to exclude while
using partial matching?
I hope that this makes sense.
Many thanks,
Natalie
-----
Natalie Van Zuydam
PhD Student
University of Dundee
nvanzuydam at dundee.ac.uk
--
View this message in context: http://r.789695.n4.nabble.com/Subsetting-a-data-frame-tp4160127p4160127.html
Sent from the R help mailing list archive at Nabble.com.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it.