Thank you again to Professor Azzalini for taking the time to expound this issue for me. It's been very instructive to look in detail at the definition of Gini mean difference and at the incorporation of weights. Noting suggestions for improving the speed of my function has also been very helpful. With my application, the weights refer to the "reliability" of a measurement, with a weight of 1 signifying high reliability and a weight close to zero indicating low reliability. The mean difference is used as a robust estimate of the standard deviation of the vector of measurements (the 0.5*sqrt(pi) multiplier is for this purpose). It seems to me that Professor Azzalini's result may be used for non-integer weights if the weights are scaled to sum to the number of measurements (ie. w <- w * length(x)/sum(w)). In this case, I think that gmd(x=c(1,2,4), w=c(1,2,1)) gives the same result at gmd(x=c(1,2,2,4), w=c(1,1,1,1)). Thank you again to all who have contributed to my understanding and R implementation of the mean difference. Regards, Andrew C. Ward CAPE Centre Department of Chemical Engineering The University of Queensland Brisbane Qld 4072 Australia andreww at cheque.uq.edu.au On Monday, April 28, 2003 6:55 PM, Adelchi Azzalini
[SMTP:azzalini at stat.unipd.it] wrote:
This is to complement my previous contribution on computation
of
Gini mean
difference - a discussion started by Andrew Ward. The index is
"defined" as
gini <- 0
for (i in 1:n)
{
for (j in 1:n) gini <- gini + freq[i]*freq[j]*abs(x
[i]-x[j])
}
gini<- gini/((sum(freq)-1)*sum(freq))
This is the so-called form "without repetition"; the variant
"with repetition"
does not have -1 in the final line.
Since computaation via the definition is totally inefficient,
alternative
approaches have been put forward, following Andrew's message.
My first version of a computationally convenient implementation
was
essentially this:
gini.md0<- function(x)
{ # x=data vector
n <-length(x)
return(4*sum((1:length(x))*sort(x)/(n*(n-1)))
-2*mean(x)*(n+1)/(n-1))
}
Since Andrew (private message) has stressed the importance in
his problem
of allowing for replicated data, here is a more general
version,
obtained by
elaborating on the previous one with a bit of algebra:
gini.md <- function(x, freq=rep(1,length(x)))
{# x=data vector, freq=vector of frequencies
if(!is.vector(x)) stop("x must be a vector")
if(length(x) != length(freq))
stop("x and freq must have same length")
if(min(freq)<0 | sum(freq)==0 | any(freq != as.integer(freq))
)
stop("freq must be counts")
x <- x[freq>0]
freq <- freq[freq>0]
j <- order(x)
x <- x[j]
n <- as.integer(freq[j])
n. <- sum(n)
u <- (cumsum(n)-n)*n+ n*(n+1)/2
return(4*sum(u*x)/(n.*(n.-1))
-2*weighted.mean(x,n)*(n.+1)/(n.-1))
}
Notice that gini.md(x,freq) gives the same of mini.md0(rep
(x,freq)), but the latter
is obviously less efficient. Either are however far more
efficient that straight
implementation of the "definition".
regards
Adelchi Azzalini
--
Adelchi Azzalini <azzalini at stat.unipd.it>
Dipart.Scienze Statistiche, Universita di Padova, Italia
http://azzalini.stat.unipd.it/
______________________________________________ R-help at stat.math.ethz.ch mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help