Skip to content
Prev 319617 / 398502 Next

Grep with wildcards across multiple columns

I think the way I set up my sample data without any explanation confused things slightly. These data might make things clearer:

# Create fake data
df <- data.frame(code   = c(rep(1001, 8), rep(1002, 8)),
                 year   = rep(c(rep(2011, 4), rep(2012, 4)), 2),
                 fund   = rep(c("10E", "27E", "27E", "29E"), 4),
                 func   = rep(c("110000", "122000", "214000", "158000"), 4),
                 obj    = rep(c("100", "100", "210", "220"), 4),
                 amount = round(rnorm(16, 50000, 10000)))

These are financial data with a hierarchical account structure where a zero represents a summary account that rolls up all the accounts at subsequent digits (e.g. 100000 rolls up 110000, 122000, 158000, etc.). I was trying to do two things with the search parameters: turn zeroes into question marks, and duplicate the functionality of a SQL query using those question marks as wildcards:

# Set parameters
par.fund <- "20E"; par.func <- "100000"; par.obj <- "000"
par.fund <- glob2rx(gsub("0", "?", par.fund))
par.func <- glob2rx(gsub("0", "?", par.func))
par.obj <- glob2rx(gsub("0", "?", par.obj))

Fortunately, Bill's suggestion to use the intersect function worked just fine--since intersect accepts only two arguments, I had to nest a pair of statements:

# Solution: Use a pair of nested intersects
dt2 <- dt[intersect(intersect(grep(par.fund, fund), grep(par.func, func)),
                    grep(par.obj, obj)),
          sum(amount), by=c('code', 'year')]
df2 <- ddply(df[intersect(intersect(grep(par.fund, df$fund),
                                    grep(par.func, df$func)),
                          grep(par.obj, df$obj)), ],
             .(code, year), summarize, amount = sum(amount))

Thanks for your ideas!

DB

Daniel Bush | School Finance Consultant 
School Financial Services | Wis. Dept. of Public Instruction 
daniel.bush -at- dpi.wi.gov | 608-267-9212

-----Original Message-----
From: William Dunlap [mailto:wdunlap at tibco.com] 
Sent: Thursday, March 14, 2013 5:49 PM
To: Bush, Daniel P. DPI; 'r-help at r-project.org'
Subject: RE: Grep with wildcards across multiple columns

grep(pattern, textVector) returns of the integer indices of the elements of textVector that match the pattern.  E.g.,
  > grep("T", c("One","Two","Three","Four"))
  [1] 2 3

The '&' operator naturally operates on logical vectors of the same length (If you give it numbers it silently converts 0 to FALSE and  other numbers to TRUE.)

The two don't fit together.  You could use grepl(), which returns a logical vector the length of textVector, as in
   grepl(p1,v1) & grepl(p2,v2)
to figure which entries in the table have v1 matching p1 and v2 matching p2.

Or, you could use
  intersect(grep(p1,v1), grep(p2,v2))
if you want to stick with integer indices.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com