How to sum values across multiple variables using a wildcard?
On Mon, 2006-02-20 at 18:41 -0600, mtb954 at gmail.com wrote:
I have a dataframe called "data" with 5 records (in rows) each of which has been scored on each of many variables (in columns). Five of the variables are named var1, var2, var3, var4, var5 using headers. The other variables are named using other conventions. I can create a new variable called var6 with the value 15 for each record with this code:
var6=var1+var2+var3+var4+var5
but this is tedious for my real dataset with dozens of variables. I would rather use a wildcard to add up all the variables that begin with "Var" like this pseudocode:
Var6=sum(var*)
Any suggestions for implementing this in R? Thanks! Mark
Here is one approach using grep(). Given a data frame called MyDF with the following structure:
str(MyDF)
`data.frame': 10 obs. of 20 variables: $ other4 : num -0.869 0.376 -2.022 0.619 -0.129 ... $ var8 : num -0.380 1.428 -1.075 -0.796 -0.588 ... $ var4 : num -0.0850 -0.7335 -0.5019 -1.1633 -0.0197 ... $ other9 : num 0.0210 -0.6455 0.0289 1.2405 -1.3359 ... $ var10 : num 0.647 -0.798 0.180 1.135 -0.258 ... $ other2 : num 0.1332 -0.2227 0.0423 0.6881 2.0304 ... $ other10: num 0.811 2.166 0.569 0.302 0.669 ... $ var1 : num -0.774 -1.812 -1.230 -0.969 0.245 ... $ var2 : num -0.0538 0.3712 0.8222 -0.8025 -0.6914 ... $ other6 : num 0.871 0.291 2.079 1.098 1.025 ... $ other1 : num -0.5130 0.1358 0.8744 0.0997 1.7458 ... $ var9 : num 0.664 -0.456 0.415 2.090 -0.283 ... $ other3 : num -0.425 -0.283 0.706 -1.879 -0.828 ... $ other7 : num 0.100 0.177 0.570 -0.631 -1.009 ... $ var3 : num 1.446 -0.862 0.184 1.077 0.146 ... $ var5 : num 0.402 -0.498 -0.906 0.641 1.690 ... $ var6 : num 0.892 -0.242 0.561 0.530 -0.291 ... $ other5 : num -1.210 0.815 -1.284 -0.152 0.329 ... $ other8 : num -0.265 -1.278 1.152 0.232 -1.189 ... $ var7 : num -0.616 -0.994 -0.263 1.626 -1.372 ... Note that the column names are either var* or other*. Using grep() we get the indices of the column names that contain "other" plus one or more following characters, where the "other" begins the word:
grep("\\bother.", names(MyDF))
[1] 1 4 6 7 10 11 13 14 18 19 See ?regexp for more information. Note that I use "\\b" to being the search at the starting word boundary and then "." to require that there be following characters. Thus, this would not match " other1" or " other". You can then use the following to subset the data frame MyDF and sum the rows for the requested columns:
rowSums(MyDF[, grep("\\bother.", names(MyDF))])
1 2 3 4 5 6 7
-1.344893 1.531417 2.715234 1.616971 1.307379 4.655568 4.638446
8 9 10
-2.640485 -2.226270 -2.158248
You could use grep("other", names(MyDF)), but this would also get
"other" if it appears anywhere in the name. For example:
grep("other", "Thisotherone")
[1] 1 It just depends upon your naming schema and how strict you need to be in the search. HTH, Marc Schwartz