Hi,
You could try this for multiple intersect:
?dt[Reduce(function(...) intersect(...), list(grep(par.fund,fund),grep(par.func,func),grep(par.obj,obj))),sum(amount),by=c('code','year')]
#?? code year???? V1
#1: 1001 2011 123528
#2: 1001 2012? 97362
#3: 1002 2011 103811
#4: 1002 2012? 97179
?dt[intersect(intersect(grep(par.fund, fund), grep(par.func, func)),
???????????????????? grep(par.obj, obj)),
?????????? sum(amount), by=c('code', 'year')]
?#? code year???? V1
#1: 1001 2011 123528
#2: 1001 2012? 97362
#3: 1002 2011 103811
#4: 1002 2012? 97179
A.K.
----- Original Message -----
From: "Bush, Daniel P. DPI" <Daniel.Bush at dpi.wi.gov>
To: "'r-help at r-project.org'" <r-help at r-project.org>
Cc: 'William Dunlap' <wdunlap at tibco.com>; "'smartpink111 at yahoo.com'" <smartpink111 at yahoo.com>; "'djmuser at gmail.com'" <djmuser at gmail.com>
Sent: Friday, March 15, 2013 10:06 AM
Subject: RE: Grep with wildcards across multiple columns
I think the way I set up my sample data without any explanation confused things slightly. These data might make things clearer:
# Create fake data
df <- data.frame(code? = c(rep(1001, 8), rep(1002, 8)),
? ? ? ? ? ? ? ? year? = rep(c(rep(2011, 4), rep(2012, 4)), 2),
? ? ? ? ? ? ? ? fund? = rep(c("10E", "27E", "27E", "29E"), 4),
? ? ? ? ? ? ? ? func? = rep(c("110000", "122000", "214000", "158000"), 4),
? ? ? ? ? ? ? ? obj? ? = rep(c("100", "100", "210", "220"), 4),
? ? ? ? ? ? ? ? amount = round(rnorm(16, 50000, 10000)))
These are financial data with a hierarchical account structure where a zero represents a summary account that rolls up all the accounts at subsequent digits (e.g. 100000 rolls up 110000, 122000, 158000, etc.). I was trying to do two things with the search parameters: turn zeroes into question marks, and duplicate the functionality of a SQL query using those question marks as wildcards:
# Set parameters
par.fund <- "20E"; par.func <- "100000"; par.obj <- "000"
par.fund <- glob2rx(gsub("0", "?", par.fund))
par.func <- glob2rx(gsub("0", "?", par.func))
par.obj <- glob2rx(gsub("0", "?", par.obj))
Fortunately, Bill's suggestion to use the intersect function worked just fine--since intersect accepts only two arguments, I had to nest a pair of statements:
# Solution: Use a pair of nested intersects
dt2 <- dt[intersect(intersect(grep(par.fund, fund), grep(par.func, func)),
? ? ? ? ? ? ? ? ? ? grep(par.obj, obj)),
? ? ? ? ? sum(amount), by=c('code', 'year')]
df2 <- ddply(df[intersect(intersect(grep(par.fund, df$fund),
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? grep(par.func, df$func)),
? ? ? ? ? ? ? ? ? ? ? ? ? grep(par.obj, df$obj)), ],
? ? ? ? ? ? .(code, year), summarize, amount = sum(amount))
Thanks for your ideas!
DB
Daniel Bush | School Finance Consultant
School Financial Services | Wis. Dept. of Public Instruction
daniel.bush -at- dpi.wi.gov | 608-267-9212
-----Original Message-----
From: William Dunlap [mailto:wdunlap at tibco.com]
Sent: Thursday, March 14, 2013 5:49 PM
To: Bush, Daniel P. DPI; 'r-help at r-project.org'
Subject: RE: Grep with wildcards across multiple columns
grep(pattern, textVector) returns of the integer indices of the elements of textVector that match the pattern.? E.g.,
? > grep("T", c("One","Two","Three","Four"))
? [1] 2 3
The '&' operator naturally operates on logical vectors of the same length (If you give it numbers it silently converts 0 to FALSE and? other numbers to TRUE.)
The two don't fit together.? You could use grepl(), which returns a logical vector the length of textVector, as in
? grepl(p1,v1) & grepl(p2,v2)
to figure which entries in the table have v1 matching p1 and v2 matching p2.
Or, you could use
? intersect(grep(p1,v1), grep(p2,v2))
if you want to stick with integer indices.
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
-----Original Message-----
From: r-help-bounces at r-project.org
[mailto:r-help-bounces at r-project.org] On Behalf Of Bush, Daniel P. DPI
Sent: Thursday, March 14, 2013 2:43 PM
To: 'r-help at r-project.org'
Subject: [R] Grep with wildcards across multiple columns
I have a fairly large data set with six variables set up like the following dummy:
# Create fake data
df <- data.frame(code? = c(rep(1001, 8), rep(1002, 8)),
? ? ? ? ? ? ? ? ? year? = rep(c(rep(2011, 4), rep(2012, 4)), 2),
? ? ? ? ? ? ? ? ? fund? = rep(c("10E", "10E", "10E", "27E"), 4),
? ? ? ? ? ? ? ? ? func? = rep(c("110000", "122000", "214000", "158000"), 4),
? ? ? ? ? ? ? ? ? obj? ? = rep("100", 16),
? ? ? ? ? ? ? ? ? amount = round(rnorm(16, 50000, 10000)))
What I would like to do is sum the amount variable by code and year,
filtering rows using different wildcard searches in each of three
columns: "1?E" in fund, "1??????" in func, and "???" in obj. I'm OK turning these into regular expressions:
# Set parameters
par.fund <- "10E"; par.func <- "100000"; par.obj <- "000"
par.fund <- glob2rx(gsub("0", "?", par.fund)) par.func <-
glob2rx(gsub("0", "?", par.func)) par.obj <- glob2rx(gsub("0", "?",
par.obj))
The problem occurs when I try to apply multiple greps across columns.
I'd prefer to use data.table since it's so much faster than plyr and I
have 159 different sets of parameters to run through, but I get the same error setting it up either way:
# Doesn't work
library(data.table)
dt <- data.table(df)
eval(parse(text=paste(
? "dt2 <- dt[", "grep('", par.fund, "', fund) & ",
? "grep('", par.func, "', func) & grep('", par.obj, "', obj)",
? ", sum(amount), by=c('code', 'year')]" , sep=""))) # Warning
message:
#? In grep("^1.E$", fund) & grep("^1.....$", func) :
#? longer object length is not a multiple of shorter object length
# Also doesn't work
library(plyr)
eval(parse(text=paste(
? "df2 <- ddply(df[grep('", par.fund, "', df$fund) & ",
? "grep('", par.func, "', df$func) & grep('", par.obj, "', df$obj), ]",
? ", .(code, year), summarize, amount = sum(amount))" , sep=""))) #
Warning message:
#? In grep("^1.E$", df$fund) & grep("^1.....$", df$func) :
#? longer object length is not a multiple of shorter object length
Clearly, the problem is how I'm trying to combine greps in subsetting
rows, but I haven't been able to find a solution that works. Any
thoughts-preferably something that works with data.table?
DB
Daniel Bush
School Finance Consultant
School Financial Services
Wisconsin Department of Public Instruction PO Box 7841 | Madison, WI
53707-7841 daniel.bush -at- dpi.wi.gov | sfs.dpi.wi.gov
Ph: 608-267-9212 | Fax: 608-266-2840