Skip to content

Very Slow Gower Similarity Function

6 messages · Tyler Smith, Anon., Jari Oksanen +1 more

#
Hello,

I am a relatively new user of R. I have written a basic function to calculate
the Gower similarity function. I was motivated to do so partly as an excercise
in learning R, and partly because the existing option (vegdist in the vegan
package) does not accept missing values.

I think I have succeeded - my function gives me the correct values. However, now
that I'm starting to use it with real data, I realise it's very slow. It takes
more than 45 minutes on my Windows 98 machine (R 2.0.1 Patched (2005-03-29))
with a 185x32 matrix with ca 100 missing values. If anyone can suggest ways to
speed up my function I would appreciate it. I suspect having a pair of nested
for loops is the problem, but I couldn't figure out how to get rid of them.

The function is:

### Gower Similarity Matrix###

sGow <- function (mat){

OBJ <- nrow(mat) #number of objects
MATDESC <- ncol (mat) #number of descriptors
MRANGE <- apply (mat,2,max, na.rm=T)-apply (mat,2,min,na.rm=T) #descr ranges
DESCRIPT <- 1:MATDESC #descriptor index vector
smat <- matrix(1, nrow = OBJ, ncol = OBJ) #'empty' similarity matrix

for (i in 1:OBJ){
  for (j in i:OBJ){

    ##calculate index vector of non-NA descriptors between objects i and j
    descvect <- intersect (setdiff (DESCRIPT, DESCRIPT[is.na(mat[i,DESCRIPT])]),
     setdiff (DESCRIPT, DESCRIPT[is.na (mat[j,DESCRIPT])]))

    descnum <- length(descvect) # number of valid descr for i~j comparison

    partialsim <- (1- abs(mat[i,descvect]-mat[j,descvect])/MRANGE[descvect])

    smat[i,j] <- smat[j,i] <- sum (partialsim) / descnum
  }
}
smat
}

Thank-you for your time,

Tyler
#
On 18 Apr 2005, at 19:10, Tyler Smith wrote:

            
Speed is the reason to use C instead of R. It should be easy, almost 
trivial, to modify the vegdist.c  so that it handles missing values. I 
guess this handling means ignoring the value pair if one of the values 
is missing -- which is not so gentle to the metric properties so dear 
to Gower. Package vegan is designed for ecological community data which 
generally do not have missing values (except in environmental data), 
but contributions are welcome.
cheers, jari oksanen
--
Jari Oksanen, Oulu, Finland
#
Jari Oksanen wrote:

            
The only reason you never see ecological community data with missing 
values is because the ecologists remove those species/sites from their 
Excel sheets before they give it to you to sort out their mess.  This is 
actually one of the few things they know how to do in Excel - I'm 
dreading the day when a paper appears in JAE saying that you can use 
Excel to produce P-values.

To be slightly more serious, as an exercise the OP could consider 
writing a wrapper function in R that removes the missing data and then 
calls vegdist to calculate his Gower similarity index.

Bob
#
On 18 Apr 2005, at 20:36, Anon. wrote:

            
Well, ecologists have plenty of missing species in their community 
data, but these have zero values since they were not observed. I guess 
some Bob O'Hara is going to have a paper about this in JAE.
The "A" in "JAE" stands for "Animal": for real things they still have 
Journal of Ecology.
The looping goes within C code, and for pairwise deletion of missing 
values wrapping is difficult. With complete.cases this is trivial (and 
then your result would be more metric as well).
--
Jari Oksanen, Oulu, Finland
#
Tyler> Hello, I am a relatively new user of R. I have
    Tyler> written a basic function to calculate the Gower
    Tyler> similarity function. I was motivated to do so partly
    Tyler> as an excercise in learning R, and partly because the
    Tyler> existing option (vegdist in the vegan package) does
    Tyler> not accept missing values.

I don't know what exactly you want.

The function  daisy() in the recommended package "cluster"
has always worked with missing values and IIRC, the book
"Kaufman & Rousseeuw" {which I have not at hand here at home},
clearly mentions Gower's origin of their distance measure
definition.

Martin Maechler, maintainer of cluster package,
ETH Zurich


    Tyler> I think I have succeeded - my function gives me the
    Tyler> correct values. However, now that I'm starting to use
    Tyler> it with real data, I realise it's very slow. It takes
    Tyler> more than 45 minutes on my Windows 98 machine (R
    Tyler> 2.0.1 Patched (2005-03-29)) with a 185x32 matrix with
    Tyler> ca 100 missing values. If anyone can suggest ways to
    Tyler> speed up my function I would appreciate it. I suspect
    Tyler> having a pair of nested for loops is the problem, but
    Tyler> I couldn't figure out how to get rid of them.

    Tyler> The function is:

    Tyler> ### Gower Similarity Matrix###

    Tyler> sGow <- function (mat){

    Tyler> OBJ <- nrow(mat) #number of objects MATDESC <- ncol
    Tyler> (mat) #number of descriptors MRANGE <- apply
    Tyler> (mat,2,max, na.rm=T)-apply (mat,2,min,na.rm=T) #descr
    Tyler> ranges DESCRIPT <- 1:MATDESC #descriptor index vector
    Tyler> smat <- matrix(1, nrow = OBJ, ncol = OBJ) #'empty'
    Tyler> similarity matrix

    Tyler> for (i in 1:OBJ){ for (j in i:OBJ){

    Tyler>     ##calculate index vector of non-NA descriptors
    Tyler> between objects i and j descvect <- intersect
    Tyler> (setdiff (DESCRIPT,
    Tyler> DESCRIPT[is.na(mat[i,DESCRIPT])]), setdiff (DESCRIPT,
    Tyler> DESCRIPT[is.na (mat[j,DESCRIPT])]))

    Tyler>     descnum <- length(descvect) # number of valid
    Tyler> descr for i~j comparison

    Tyler>     partialsim <- (1-
    Tyler> abs(mat[i,descvect]-mat[j,descvect])/MRANGE[descvect])

    Tyler>     smat[i,j] <- smat[j,i] <- sum (partialsim) /
    Tyler> descnum } } smat }

    Tyler> Thank-you for your time,

    Tyler> Tyler

    Tyler> -- Tyler Smith

    Tyler> PhD Candidate Plant Science Department McGill
    Tyler> University

    Tyler> tyler.smith at mail.mcgill.ca

    Tyler> ______________________________________________
    Tyler> R-help at stat.math.ethz.ch mailing list
    Tyler> https://stat.ethz.ch/mailman/listinfo/r-help PLEASE
    Tyler> do read the posting guide!
    Tyler> http://www.R-project.org/posting-guide.html
#
Quoting Martin Maechler <maechler at stat.math.ethz.ch>:
The Gower coefficient I am referring to comes from his 1971 article in
Biometrics (27(4):857-871). It differs from most commonly used measures (but
not, apparently, daisy!) by allowing the incorporation of quantitative and
qualitative (binary or unordered multistate characters) variables, and also by
providing a mechanism for dropping missing values from similarity calculations.
This is also covered in Legendre and Legendre.
I was unaware of the daisy function. Looking over it now it differs from the
Gower coefficient primarily in the method of standardization. Gower
standardized each variable by dividing it by it's range ("ranging"), where
daisy does a more conventional standardization (-mean and /SD). As I understand
it, there isn't much to recommend standardizing over ranging (or vice versa) so
daisy may provide a useful alternative for my project. I'll have to look into
it!

Thanks,

Tyler