I still need to do some repetitive statistical analysis on some outcomes from a dataset. Take the following as an example; id sex hiv age famsize bmi resprate 1 M Pos 23 2 16 15 2 F Neg 24 5 18 14 3 F Pos 56 14 23 24 4 F Pos 67 3 33 31 5 M Neg 34 2 21 23 I want to know if there are statistically detectable differences in all of the continuous variables in my data set when subdivided by sex or hiv status (ie are age, family size, bmi and resprate different in my male and female patients or in hiv pos/neg patients) Of course I can use wilcoxon or t-tests e.g: wilcox.test( age~sex) wilcox.test(famsize~sex) wilcox.test(bmi~sex) wilcox.test(resprate~sex) wilcox.test( age~hiv) wilcox.test(famsize~hiv) wilcox.test(bmi~hiv) wilcox.test(resprate~hiv) but there must be some easy way of looping/automating this code (i.e. get all the continuous variables analysed one by one by sex, then analysed one by one by hiv status). Obviously my actual dataset is considerably bigger than what is shown here - I have many variables to assess making the longhand instruction to do every test pretty unsatisfactory. I think I can use ?for? or some other looping command for this purpose but I can?t work out how. I think I don?t properly understand how loops work yet as I'm still quite new to R. Please could someone help ? ideally with an explanation and some quick sample code? Derek -- View this message in context: http://r.789695.n4.nabble.com/Using-functions-loops-for-repetitive-commands-tp3498006p3498006.html Sent from the R help mailing list archive at Nabble.com.
Using functions/loops for repetitive commands
10 messages · Gerrit Eichner, Shekhar, dereksloan +1 more
Hello, Derek, see below.
On Thu, 5 May 2011, dereksloan wrote:
I still need to do some repetitive statistical analysis on some outcomes from a dataset. Take the following as an example; id sex hiv age famsize bmi resprate 1 M Pos 23 2 16 15 2 F Neg 24 5 18 14 3 F Pos 56 14 23 24 4 F Pos 67 3 33 31 5 M Neg 34 2 21 23 I want to know if there are statistically detectable differences in all of the continuous variables in my data set when subdivided by sex or hiv status (ie are age, family size, bmi and resprate different in my male and female patients or in hiv pos/neg patients) Of course I can use wilcoxon or t-tests e.g: wilcox.test( age~sex) wilcox.test(famsize~sex) wilcox.test(bmi~sex) wilcox.test(resprate~sex) wilcox.test( age~hiv) wilcox.test(famsize~hiv) wilcox.test(bmi~hiv) wilcox.test(resprate~hiv) .... [snip]
Define, e. g.,
my.wilcox.tests <- function( var.names, groupvar.name, data) {
lapply( var.names,
function( v) {
form <- as.formula( paste( v, "~", groupvar.name))
wilcox.test( form, data = data)
} )
}
and call something like
my.wilcox.test( <character vector with relevant variable names>,
<character string with relevant grouping variable>,
data = <your data set as data frame>)
Caveat: untested!
Hth -- Gerrit
---------------------------------------------------------------------
Dr. Gerrit Eichner Mathematical Institute, Room 212
gerrit.eichner at math.uni-giessen.de Justus-Liebig-University Giessen
Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, Germany
Fax: +49-(0)641-99-32109 http://www.uni-giessen.de/cms/eichner
Hi Derek,
You can accomplish your loop jobs by following means:
(a) use for loop
(b) use while loop
(c) use lapply, tapply, or sapply. (i feel "lapply is the elegant
way )
---------------For Loop-----------------------------
"for" loops are pretty simple to use and is almost similar to any
other scripting languages you know.( I am referring to Matlab)
(Example 1) lets say you know that you have to run 10 iterations then
you can run it as
for(i in 1:10) print(i)
//it will print the number from 1 to 10
(Example 2) You don't know how many iterations you need to run. Only
thing you have is some vector and you want to do some operation on
that vector. You can do something like this:
myVector<-c(20,45,23,45,89)
for(i in seq_along(myVector)) print(myVector[i]
-------------Using lapply-------------------------
In "lapply" you need to provide mainly two things:
(1)First parameter: vectors or some sequence of numbers
(2)Second parameter: A function which could be user defined function
or some other inbuilt function.
lapply will call the function for every number given in the "First
parameter of the function)
For example:
x<-c(10,20,20)
lapply(seq_along(x),function(i) {//your logic})
if you see the first parameter i have sent seq_along(x). The outcome
of seq_along(x) will be 1, 2,3.
Now lapply will take each of these numbers and call the function. That
means lapply is calling the function thrice for the current data set
something like this
function(1) { //your logic}
function(2) { }
function(3) { //)
That means your logic inside the function will be executed for each
and every value specified in the first parameter of the lapply
function.
I hope it helps you in some way.
For your problem, i am making a guess that you are using data frame or
matrix to store the data and then you want to automate the data right?
You can try using "lapply", i think that would be efficient..Let me
also try ..
Regards,
Som Shekhar
Your code may be untested but it works - also helping me slowly to start
understanding how to write functions. Thank you.
However I still have difficulty. I also have some categorical variables to
analyse by age & hiv status - i.e. my dataset expands to (for example);
id sex hiv age famsize bmi resprate smoker alcohol
1 M Pos 23 2 16 15 Y Y
2 F Neg 24 5 18 14 Y Y
3 F Pos 56 14 23 24 Y N
4 F Pos 67 3 33 31 N N
5 M Neg 34 2 21 23 N N
Using the template for the code you sent me I thought I could analyse the
categorical variables by sex & hiv status using a chiq-squared test;
Long-hand this would be;
chisq.test(smoker,sex)
chisq.test(alcohol,sex)
chisq.test(smoker,hiv)
chisq.test(alcohol,hiv)
Again I wanted to use a function to loop automate it and thought I could
write;
categ<-c(smoker,alcohol)
group.name<-c(sex,hiv)
bl.chisq<-function(categ,group.name,<dataframe name>){
lapply(categ,
function(y){
form2<-as.formula(paste(y,group.name))
chisq.test(form2,<dataframe name>)
})
}
bl.chisq(categ,group.name,<data frame name>)
but I get an error message:
Error in parse(text = x) : unexpected symbol in "smoker sex"
What is wrong with the code? Is is because the wilcox.test is a formula
(with a ~ symbol for modelling) whilst the chisq.test simply requires me to
list raw data? If so how can I change my code to automate the chisq.test in
the same way I did for the wilcox.test?
Many thanks for any help!
Derek
--
View this message in context: http://r.789695.n4.nabble.com/Using-functions-loops-for-repetitive-commands-tp3498006p3498427.html
Sent from the R help mailing list archive at Nabble.com.
On May 5, 2011, at 10:01 AM, dereksloan wrote:
Your code may be untested but it works - also helping me slowly to
start
understanding how to write functions. Thank you.
However I still have difficulty. I also have some categorical
variables to
analyse by age & hiv status - i.e. my dataset expands to (for
example);
id sex hiv age famsize bmi resprate smoker alcohol
1 M Pos 23 2 16 15 Y Y
2 F Neg 24 5 18 14 Y Y
3 F Pos 56 14 23 24 Y N
4 F Pos 67 3 33 31 N N
5 M Neg 34 2 21 23 N N
Using the template for the code you sent me I thought I could
analyse the
categorical variables by sex & hiv status using a chiq-squared test;
Long-hand this would be;
chisq.test(smoker,sex)
chisq.test(alcohol,sex)
chisq.test(smoker,hiv)
chisq.test(alcohol,hiv)
Again I wanted to use a function to loop automate it and thought I
could
write;
categ<-c(smoker,alcohol)
group.name<-c(sex,hiv)
bl.chisq<-function(categ,group.name,<dataframe name>){
lapply(categ,
function(y){
form2<-as.formula(paste(y,group.name))
I haven't tested it but I suspect you failed to note that Eichner used sep="~" in his paste argument to as.formula().
chisq.test(form2,<dataframe name>) }) } bl.chisq(categ,group.name,<data frame name>) but I get an error message: Error in parse(text = x) : unexpected symbol in "smoker sex" What is wrong with the code? Is is because the wilcox.test is a formula (with a ~ symbol for modelling) whilst the chisq.test simply requires me to list raw data? If so how can I change my code to automate the chisq.test in the same way I did for the wilcox.test? Many thanks for any help! Derek
David Winsemius, MD West Hartford, CT
Thanks David, I did notice that and I got his code to work using wilcox.test for the continuous variables. The problem is that when I tried to alter the code to do chisq.test on my categorical variables there is something wrong with the syntax and I don't know what. Derek -- View this message in context: http://r.789695.n4.nabble.com/Using-functions-loops-for-repetitive-commands-tp3498006p3498896.html Sent from the R help mailing list archive at Nabble.com.
On May 5, 2011, at 1:08 PM, dereksloan wrote:
Thanks David, I did notice that and I got his code to work using wilcox.test for the continuous variables. The problem is that when I tried to alter the code to do chisq.test on my categorical variables there is something wrong with the syntax and I don't know what.
Right.... > ?chisq.test # No mention of a formula argument seen > ?chisq.test.formula No documentation for 'chisq.test.formula' in specified packages and libraries: you could try '??chisq.test.formula' `chisq.test` doesn't have a formula method, so sending it a formula will fail. Why aren't you sending it the arguments instead of turning them into strings?
Derek -- View this message in context: http://r.789695.n4.nabble.com/Using-functions-loops-for-repetitive-commands-tp3498006p3498896.html Sent from the R help mailing list archive at Nabble.com.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD West Hartford, CT
Thanks a lot, I understand what you say but I'm having problems - maybe with the syntax or the specific command. You are right - I have a dataframe to store the data and want to automate the analysis. i.e. I want do a chisq.test with to know if alcohol intake (Y/N) differs between sexes, then if smoking (Y/N) differs between sexes, then if alcohol intake or smoking differ by hiv status. The command within my data frame for each individual comparison is e.g. chisq.test(alcohol,sex)... then repeat it for all combination of variables. but using lapply I'm still unsure how to design the loop. I'll keep trying - let me know if you have more ideas. Derek -- View this message in context: http://r.789695.n4.nabble.com/Using-functions-loops-for-repetitive-commands-tp3498006p3499001.html Sent from the R help mailing list archive at Nabble.com.
On May 5, 2011, at 1:45 PM, dereksloan wrote:
Thanks a lot, I understand what you say but I'm having problems - maybe with the syntax or the specific command. You are right - I have a dataframe to store the data and want to automate the analysis. i.e. I want do a chisq.test with to know if alcohol intake (Y/N) differs between sexes, then if smoking (Y/N) differs between sexes, then if alcohol intake or smoking differ by hiv status. The command within my data frame for each individual comparison is e.g. chisq.test(alcohol,sex)... then repeat it for all combination of variables.
I don't generally answer questions that support shotgun approaches to manufacturing p-values for fear of encouraging unprincipled data- ming ... unless it is clear that the questioner understands what he are doing from a statistical point of view. So my apologies. I probably shouldn't have even posted in this case. I misunderstood the question and thought it was just a quick syntactic fix. I now understand it to be more involved and really demands more care and respect than I was giving it.
but using lapply I'm still unsure how to design the loop. I'll keep trying - let me know if you have more ideas. Derek
David Winsemius, MD West Hartford, CT
Hello, Derek, first of all, be very aware of what David Winsemius said; you are about to enter the area of "unprincipled data-mining" (as he called it) with its trap -- one of many -- of multiple testing. So, *if* you know what the consequences and possible remedies are, a purely R-syntactic "solution" to your problem might be the (again not fully tested) hack below.
If so how can I change my code to automate the chisq.test in the same way I did for the wilcox.test?
Try
lapply( <your_data_frame>[<selection_of_relevant_components>],
function( y)
chisq.test( y, <your_data_frame>$<group_name>)
)
or even shorter:
lapply( <your_data_frame>[<selection_of_relevant_components>],
chisq.test, <your_data_frame>$<group_name>
)
However, in the resulting output you will not be seeing the names of the
variables that went into the first argument of chisq.test(). This is a
little bit more complicated to resolve:
lapply( names( <your_data_frame>[<selection_of_relevant_components>]),
function( y)
eval( substitute( chisq.test( <your_data_frame>$y0,
<your_data_frame>$tension),
list( y0 = y) ) )
)
Still another possibility is to use xtabs() (with its summary-method)
which has a formula argument.
Hoping that you know what to do with the results -- Gerrit
---------------------------------------------------------------------
Dr. Gerrit Eichner Mathematical Institute, Room 212
gerrit.eichner at math.uni-giessen.de Justus-Liebig-University Giessen
Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, Germany
Fax: +49-(0)641-99-32109 http://www.uni-giessen.de/cms/eichner