ave function - R-help | R Mailing Lists

arun

Wed, Aug 21, 2013 2:00 PM #

Hi Robert,


source("shareB101")
##Clean is the dataset
?res1<-with(Clean,aggregate(GRADE,list(TERM,INST_NUM),FUN=function(x) cbind(shapiro.test(x)$p.value,shapiro.test(x)$statistic)) )
?head(res1)
#? Group.1 Group.2????????? x.1????????? x.2
#1? 201001? 689809 1.720329e-07 9.307362e-01
#2? 201201? 689809 2.029761e-11 9.139405e-01
#3? 201301? 689809 4.709662e-14 8.791063e-01
#4? 200701? 994474 3.695317e-14 7.939902e-01
#5? 200710? 994474 4.560275e-13 8.849943e-01
#6? 201203 1105752 4.434649e-15 9.220643e-01


#Regarding the lapply() error, it was the same problem as I thought:

lapply(split(Clean,list(Clean$TERM,Clean$INST_NUM)),function(x) shapiro.test(x$GRADE))
#Error in shapiro.test(x$GRADE) : sample size must be between 3 and 5000


lst1<-split(Clean,list(Clean$TERM,Clean$INST_NUM))
lst2<- lapply(lst1[lapply(lst1,nrow)>0], function(x) shapiro.test(x$GRADE))
?lst2[[1]]

#??? Shapiro-Wilk normality test
#
#data:? x$GRADE
#W = 0.9307, p-value = 1.72e-07


library(plyr)
?res2<- ldply(dlply(Clean,.(TERM,INST_NUM), function(x) shapiro.test(x$GRADE)), summarize, pval=p.value,stat1=statistic)
?head(res2)
#??? TERM INST_NUM???????? pval???? stat1
#1 200610? 1106842 1.420787e-11 0.9192428
#2 200610? 1324438 2.345177e-12 0.9048394
#3 200610? 1557630 4.618117e-10 0.8968445
#4 200701?? 994474 3.695317e-14 0.7939902
#5 200701? 1106842 2.745429e-08 0.9292158
#6 200701? 1107019 6.887642e-10 0.9213602


A.K.

From: Robert Lynch <robert.b.lynch at gmail.com>
To: arun <smartpink111 at yahoo.com>
Sent: Wednesday, August 21, 2013 4:49 PM
Subject: Re: [R] ave function

Arun--

Thanks I had no idea about dput. ?I really appreciate your help. ?I have attached an example data set from dput. ?Not to worry the ID#s have been changed but I wanted to include them just in case they were part of the issue ( though i doubt it).



On Tue, Aug 20, 2013 at 7:27 PM, arun <smartpink111 at yahoo.com> wrote:

HI,
>
>
>I guess your original dataset would have some list elements as empty.
>
>Clean<- structure(list(GRADE = c(1, 2, 3, 1.5, 1.75, 2, 0.5, 2, 3.5,
>3.5, 3.75, 4), TERM = c(9L, 9L, 9L, 8L, 8L, 8L, 9L, 9L, 9L, 8L,
>8L, 8L), INST_NUM = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L,
>1L, 1L)), .Names = c("GRADE", "TERM", "INST_NUM"), class = "data.frame", row.names = c(NA,
>-12L))
>
>? lapply(split(Clean,list(Clean$TERM,Clean$INST_NUM)),function(x) shapiro.test(x$GRADE))
>#$`8.1`
>
>#??? Shapiro-Wilk normality test
>#
>#data:? x$GRADE
>#W = 1, p-value = 1
>#
>
>#$`9.1`
>#
>?# ? Shapiro-Wilk normality test
>#
>#data:? x$GRADE
>#W = 1, p-value = 1
>
>-----------------------------------------------------
>? sapply(split(Clean,list(Clean$TERM,Clean$INST_NUM)),function(x) shapiro.test(x$GRADE)$p.value)
>#8.1 9.1 8.2 9.2
>?# 1?? 1?? 1?? 1
>with(Clean, aggregate(GRADE,list(TERM,INST_NUM),FUN=shapiro.test)) #the output is a list,
>#? Group.1 Group.2 x
>#1?????? 8?????? 1 1
>#2?????? 9?????? 1 1
>#3?????? 8?????? 2 1
>#4?????? 9?????? 2 1
>#Warning message:
>#In format.data.frame(x, digits = digits, na.encode = FALSE) :
>?# corrupt data frame: columns will be truncated or padded with NAs
>
>
>
>library(plyr)
>ldply(dlply(Clean,.(TERM,INST_NUM), function(x) shapiro.test(x$GRADE)), summarize, pval=p.value)
>#? TERM INST_NUM pval
>#1 ?? 8??????? 1??? 1
>#2??? 8??????? 2??? 1
>#3??? 9??????? 1??? 1
>#4??? 9??????? 2??? 1
>
>
>
>Now, consider this example:
>
>Clean1<- structure(list(GRADE = c(1, 2, 3, 1.5, 1.75, 2, 0.5, 2, 3.5,
>3.5, 3.75, 4, 4.5, 4.25, 4.32), TERM = c(9L, 9L, 9L, 8L, 8L,
>8L, 9L, 9L, 9L, 8L, 8L, 8L, 10L, 10L, 10L), INST_NUM = c(1L,
>1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L)), .Names = c("GRADE",
>"TERM", "INST_NUM"), class = "data.frame", row.names = c(NA,
>-15L))
>lapply(split(Clean1,list(Clean1$TERM,Clean1$INST_NUM)),function(x) shapiro.test(x$GRADE))
>#Error in shapiro.test(x$GRADE) : sample size must be between 3 and 5000
>
>?split(Clean1,list(Clean1$TERM,Clean1$INST_NUM))[[6]] ###0 rows
>#[1] GRADE??? TERM???? INST_NUM
>#<0 rows> (or 0-length row.names)
>
>
>lst1<-split(Clean1,list(Clean1$TERM,Clean1$INST_NUM))
>lapply(lst1[lapply(lst1,nrow)>0], function(x) shapiro.test(x$GRADE))
>#$`8.1`
>#
>?# ? Shapiro-Wilk normality test
>#
>#data:? x$GRADE
>#W = 1, p-value = 1
>
>
>You could do this directly with:
>?ldply(dlply(Clean1,.(TERM,INST_NUM), function(x) shapiro.test(x$GRADE)), summarize, pval=p.value)
>#? TERM INST_NUM????? pval
>#1??? 8??????? 1 1.0000000
>#2??? 8??????? 2 1.0000000
>#3??? 9??????? 1 1.0000000
>#4??? 9??????? 2 1.0000000
>#5?? 10??????? 1 0.5248807
>?ldply(dlply(Clean1,.(TERM,INST_NUM), function(x) shapiro.test(x$GRADE)), summarize, pval=p.value,stat1=statistic)
>#? TERM INST_NUM????? pval???? stat1
>#1??? 8??????? 1 1.0000000 1.0000000
>#2??? 8??????? 2 1.0000000 1.0000000
>#3??? 9??????? 1 1.0000000 1.0000000
>#4??? 9??????? 2 1.0000000 1.0000000
>#5?? 10??????? 1 0.5248807 0.9393788
>
>
>
>#or
>?with(Clean1, aggregate(GRADE,list(TERM,INST_NUM),FUN=function(x) shapiro.test(x)$p.value))
>? Group.1 Group.2???????? x
>1?????? 8?????? 1 1.0000000
>2?????? 9?????? 1 1.0000000
>3????? 10?????? 1 0.5248807
>4?????? 8?????? 2 1.0000000
>5?????? 9?????? 2 1.0000000
>
>#If you want both pvalue and statistic
>with(Clean1, aggregate(GRADE,list(TERM,INST_NUM),FUN=function(x) cbind(shapiro.test(x)$p.value,shapiro.test(x)$statistic)) )
>#? Group.1 Group.2?????? x.1?????? x.2
>#1?????? 8?????? 1 1.0000000 1.0000000
>#2?????? 9?????? 1 1.0000000 1.0000000
>#3????? 10?????? 1 0.5248807 0.9393788
>#4?????? 8?????? 2 1.0000000 1.0000000
>#5?????? 9?????? 2 1.0000000 1.0000000
>
>
>Hope this helps.
>
>
>A.K.
>
>
>________________________________
>From: Robert Lynch <robert.b.lynch at gmail.com>
>
>To: arun <smartpink111 at yahoo.com>
>Cc: R help <r-help at r-project.org>
>Sent: Tuesday, August 20, 2013 8:00 PM
>Subject: Re: [R] ave function
>
>
>
>
>I tried?
>> lapply(split(Clean,list(Clean$TERM,Clean$INST_NUM)),function(x) shapiro.test(x$GRADE))
>
>?and I got
>
>>Error in shapiro.test(x$GRADE.) : sample size must be between 3 and 5000
>
>
>I also tried
>with(Clean, aggregate(GRADE,list(TERM,INST_NUM),FUN=shapiro.test))
>
>
>and got
>? Group.1 Group.2 ? ? ? ? x
>1 ? 201001 ?689809 0.9546164
>2 ? 201201 ?689809 0.9521624
>3 ? 201301 ?689809 0.9106206
>4 ? 200701 ?994474 0.8862705
>5 ? 200710 ?994474 0.9176743
>6 ? 201203 1105752 0.9382688
>.
>.
>.
>72 ?201001 1759272 0.9291295
>73 ?201101 1759272 0.9347072
>74 ?201110 1897809 0.9395375
>Warning message:
>In format.data.frame(x, digits = digits, na.encode = FALSE) :
>? corrupt data frame: columns will be truncated or padded with NAs
>
>I am not sure how to interpret the output of the second.
>
>Thanks!
>
>
>
>On Tue, Aug 13, 2013 at 11:01 AM, arun <smartpink111 at yahoo.com> wrote:
>
>Hi,
>>You could try:
>>?lapply(split(Clean,list(Clean$TERM,Clean$INST_NUM)),function(x) shapiro.test(x$GRADE))
>>A.K.
>>
>>
>>
>>
>>
>>----- Original Message -----
>>From: Robert Lynch <robert.b.lynch at gmail.com>
>>To: r-help at r-project.org
>>Cc:
>>Sent: Tuesday, August 13, 2013 1:46 PM
>>Subject: [R] ave function
>>
>>I've written the following function
>>CoursePrep <- function (Source, SaveName) {
>>
>>
>>? Clean$TERM <- as.factor(Clean$TERM)
>>
>>? Clean$INST_NUM <- as.factor(Clean$INST_NUM)
>>? Clean$zGrade <- with(Clean, ave(GRADE., list(TERM, INST_NUM), FUN =
>>scale))
>>? write.csv(Clean,paste(SaveName, "csv", sep ="."), row.names = FALSE)
>>? return(Clean)
>>}
>>
>>which is all well and good, but I wan't to throw a shapiro.test in before I
>>normalize.? that is I don't really understand quite how I did ( I got help)
>>what I wanted to in the
>>Clean$zGrade <- with(Clean, ave(GRADE., list(TERM, INST_NUM), FUN = scale))
>>that code for the whole of Clean finds all sets of GRADE.'s that have the
>>same INST_NUM and TERM computes a mean, subtracts off the mean and divides
>>by the standard deviation. I would like to for each one of those sets of
>>grades to call shapiro.test() on the set, to see if it is normal *before* I
>>assume it is.
>>
>>I know the naive
>>with(Clean, shapiro.test( list(TERM, INST_NUM)))
>>doesn't work.
>>with(Clean, ave(GRADE., list(TERM, INST_NUM), FUN =
>>function(x)shapiro.test(x)))
>>
>>which returns
>>Error in shapiro.test(x) : sample size must be between 3 and 5000
>>and I have checked that the sets selected are all of length between 3 and
>>5000.
>>using the following on my full data
>>
>>ClassSize <- with(Clean, ave(GRADE., list(TERM, INST_NUM), FUN =
>>function(x)length(x)))
>>> summary(ClassSize)
>>? ?Min. 1st Qu.? Median? ? Mean 3rd Qu.? ? Max.
>>? ?22.0? ?198.0? ?241.0? ?244.4? ?279.0? ?466.0
>>
>>here is some sample data
>>GRADE? ? ?TERM? ? ?INST_NUM
>>1,? ? ? ? ? ? ? 9,? ? ? ? ? ?1
>>2,? ? ? ? ? ? ? 9,? ? ? ? ? ?1
>>3,? ? ? ? ? ? ? 9,? ? ? ? ? ?1
>>1.5,? ? ? ? ? ?8,? ? ? ? ? ?2
>>1.75,? ? ? ? ?8,? ? ? ? ? ?2
>>2,? ? ? ? ? ? ? 8,? ? ? ? ? 2
>>0.5,? ? ? ? ? ?9,? ? ? ? ? ?2
>>2,? ? ? ? ? ? ? 9,? ? ? ? ? 2
>>3.5,? ? ? ? ? ?9,? ? ? ? ? 2
>>3.5,? ? ? ? ? ? 8,? ? ? ? ?1
>>3.75,? ? ? ? ? 8,? ? ? ? ?1
>>4,? ? ? ? ? ? ? ?8,? ? ? ? ? 1
>>
>>and hopefully the code would test the following set of grades
>>(1,2,3)(1.5,1.75,2)(0.5,2,3.5)(3.5,3.75,4)
>>
>>Thanks Robert
>>
>>??? [[alternative HTML version deleted]]
>>
>>______________________________________________
>>R-help at r-project.org mailing list
>>https://stat.ethz.ch/mailman/listinfo/r-help
>>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>and provide commented, minimal, self-contained, reproducible code.
>>
>>
>