Skip to content

Clustering large data

12 messages · Peter Solymos, ONKELINX, Thierry, Farrar.David at epamail.epa.gov +5 more

#
Dear Thierry,

the 'mefa' package should do this, and I am also interested in the
testing of the package for such a large number of species. I have used
it before with 75K records, but only with ~160 species and 1052 sites.
So please let me know if it worked!

You can do the clustering like this (SAMPLES and SPECIES are the two
column in the long format, have to be the same length):

x <- mefa(stcs(data.frame(SAMPLES,SPECIES)))
cl <- hclust(dist(x$xtab))

Hope this works,

Peter

Peter Solymos, PhD
Department of Mathematical and Statistical Sciences
University of Alberta
Edmonton, Alberta, T6G 2G1
CANADA



On Tue, Oct 7, 2008 at 4:12 AM, ONKELINX, Thierry
<Thierry.ONKELINX at inbo.be> wrote:
2 days later
#
Dear all,

Thanks for your responses. The biggest problem seems to be cast() for
the reshape package which could not handle the dataset. Peter's solution
using the mefa package worked fine. I found an other solution: table()
which works fine to crosstabulate presence-only data.

After crosstabulation I tried a few clusering methods. Agnes(), diana()
and hclust() gave a solution. Daisy() gave an out-of-memory error.

A follow-up question: I'm looking at the group membership with cutree().
It gives me something like:
2 3 4 5 6 7 8
[1,] 1 1 1 1 1 1 1
[2,] 1 1 1 2 2 2 2
[3,] 1 1 2 3 3 3 3
[4,] 1 1 2 3 3 3 8
[5,] 1 1 2 3 3 4 4
[6,] 2 2 3 4 4 5 5
[7,] 2 2 3 4 5 6 6
[8,] 2 3 4 5 6 7 7

But I'm looking for a binary or dendrogram like coding of the group
membership. That would be more convenient for mapping the group
membership.

[1,] 111
[2,] 110
[3,] 1011
[4,] 1010
[5,] 100
[6,] 011
[7,] 010
[8,] 00

Any suggestions on that?

Thierry

------------------------------------------------------------------------
----
ir. Thierry Onkelinx
Instituut voor natuur- en bosonderzoek / Research Institute for Nature
and Forest
Cel biometrie, methodologie en kwaliteitszorg / Section biometrics,
methodology and quality assurance
Gaverstraat 4
9500 Geraardsbergen
Belgium 
tel. + 32 54/436 185
Thierry.Onkelinx at inbo.be 
www.inbo.be 

To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to
say what the experiment died of.
~ Sir Ronald Aylmer Fisher

The plural of anecdote is not data.
~ Roger Brinner

The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of
data.
~ John Tukey

-----Oorspronkelijk bericht-----
Van: r-sig-ecology-bounces at r-project.org
[mailto:r-sig-ecology-bounces at r-project.org] Namens Peter Solymos
Verzonden: dinsdag 7 oktober 2008 15:51
Aan: r-sig-ecology at r-project.org
Onderwerp: Re: [R-sig-eco] Clustering large data

Dear Thierry,

the 'mefa' package should do this, and I am also interested in the
testing of the package for such a large number of species. I have used
it before with 75K records, but only with ~160 species and 1052 sites.
So please let me know if it worked!

You can do the clustering like this (SAMPLES and SPECIES are the two
column in the long format, have to be the same length):

x <- mefa(stcs(data.frame(SAMPLES,SPECIES)))
cl <- hclust(dist(x$xtab))

Hope this works,

Peter

Peter Solymos, PhD
Department of Mathematical and Statistical Sciences
University of Alberta
Edmonton, Alberta, T6G 2G1
CANADA



On Tue, Oct 7, 2008 at 4:12 AM, ONKELINX, Thierry
<Thierry.ONKELINX at inbo.be> wrote:
in
dataset
------------------------------------------------------------------------
more
to
not
schrijver weer
bevestigd is
message
as stating
by a duly
_______________________________________________
R-sig-ecology mailing list
R-sig-ecology at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

Dit bericht en eventuele bijlagen geven enkel de visie van de schrijver weer 
en binden het INBO onder geen enkel beding, zolang dit bericht niet bevestigd is
door een geldig ondertekend document. The views expressed in  this message 
and any annex are purely those of the writer and may not be regarded as stating 
an official position of INBO, as long as the message is not confirmed by a duly 
signed document.
#
Exactly what error did you get?  Or did it just take a very long time
and then you gave up?  I have an experimental rewrite of the reshape
package that is more memory efficient and much faster (10 - 20x) -
however, it's still some time from being ready for production use.

Hadley
#
Hi Hadley,

R ran out of memory. I got the "can't allocate vector of xxx mb" type of
error.

I did something like this.

Dataset #reading a two column (species, location) dataframe from a
database. 1154024 rows, 1381 species and 6354 locations.
Dataset$value <- 1
library(reshape)
cast(data = Dataset, formula = species ~ location) #this gave the error

Thierry

------------------------------------------------------------------------
----
ir. Thierry Onkelinx
Instituut voor natuur- en bosonderzoek / Research Institute for Nature
and Forest
Cel biometrie, methodologie en kwaliteitszorg / Section biometrics,
methodology and quality assurance
Gaverstraat 4
9500 Geraardsbergen
Belgium 
tel. + 32 54/436 185
Thierry.Onkelinx at inbo.be 
www.inbo.be 

To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to
say what the experiment died of.
~ Sir Ronald Aylmer Fisher

The plural of anecdote is not data.
~ Roger Brinner

The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of
data.
~ John Tukey

-----Oorspronkelijk bericht-----
Van: hadley wickham [mailto:h.wickham at gmail.com] 
Verzonden: vrijdag 10 oktober 2008 14:40
Aan: ONKELINX, Thierry
CC: Peter Solymos; r-sig-ecology at r-project.org
Onderwerp: Re: [R-sig-eco] Clustering large data
solution
Exactly what error did you get?  Or did it just take a very long time
and then you gave up?  I have an experimental rewrite of the reshape
package that is more memory efficient and much faster (10 - 20x) -
however, it's still some time from being ready for production use.

Hadley
3 days later
#
Hi Hadley,

Here is a more elaborate report of what I did and what when wrong. The
example is not reproducible because the dataset is to large. A smaller
dummy dataset is not an option as it works with smaller datasets. I'm
willing to run the code again with a development version of reshape.

Cheers,

Thierry
Loading required package: plyr
sysname                      release 
                   "Windows"                         "XP" 
                     version                     nodename 
"build 2600, Service Pack 2"                 "LHPA000838" 
                     machine                        login 
                       "x86"           "thierry_onkelinx" 
                        user 
          "thierry_onkelinx"
R version 2.7.2 (2008-08-25) 
i386-pc-mingw32 

locale:
LC_COLLATE=Dutch_Belgium.1252;LC_CTYPE=Dutch_Belgium.1252;LC_MONETARY=Du
tch_Belgium.1252;LC_NUMERIC=C;LC_TIME=Dutch_Belgium.1252

attached base packages:
[1] stats     graphics  grDevices datasets  tcltk     utils     methods

[8] base     

other attached packages:
[1] reshape_0.8.1  plyr_0.1       RODBC_1.2-3    svSocket_0.9-5
svIO_0.9-5    
[6] R2HTML_1.59    svMisc_0.9-5   svIDE_0.9-5   

loaded via a namespace (and not attached):
[1] tools_2.7.2
Location, TaxonFK AS Species FROM kmhok_periode2_selectie ORDER BY
KMhokcode, TaxonFK", as.is = TRUE)
[1] 1157024       3
[1] 6354
[1] 1381
= 0))
   user  system elapsed 
   0.11    0.00    0.17
= 0))
   user  system elapsed 
    1.7     0.0     1.7
fill = 0))
   user  system elapsed 
  46.42    0.45   47.02
Error: cannot allocate vector of size 33.5 Mb
Timing stopped at: 322.95 3.43 327.4
user  system elapsed 
   1.10    0.00    1.11 
 


------------------------------------------------------------------------
----
ir. Thierry Onkelinx
Instituut voor natuur- en bosonderzoek / Research Institute for Nature
and Forest
Cel biometrie, methodologie en kwaliteitszorg / Section biometrics,
methodology and quality assurance
Gaverstraat 4
9500 Geraardsbergen
Belgium 
tel. + 32 54/436 185
Thierry.Onkelinx at inbo.be 
www.inbo.be 

To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to
say what the experiment died of.
~ Sir Ronald Aylmer Fisher

The plural of anecdote is not data.
~ Roger Brinner

The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of
data.
~ John Tukey

-----Oorspronkelijk bericht-----
Van: r-sig-ecology-bounces at r-project.org
[mailto:r-sig-ecology-bounces at r-project.org] Namens hadley wickham
Verzonden: vrijdag 10 oktober 2008 14:40
Aan: ONKELINX, Thierry
CC: r-sig-ecology at r-project.org
Onderwerp: Re: [R-sig-eco] Clustering large data
solution
Exactly what error did you get?  Or did it just take a very long time
and then you gave up?  I have an experimental rewrite of the reshape
package that is more memory efficient and much faster (10 - 20x) -
however, it's still some time from being ready for production use.

Hadley
#
Hi all,

I have a related question concerning cluster analysis of large data 
sets.  In my case, the matrix is reasonably small for R to work with, 
but I have so many species (~2000) that it is not possible to read 
labels on the resulting dendrogram.  I imagine that using an 
ordination is a preferable method in this case, but I was wondering 
whether anyone had any recommendations for producing a very large, 
but still readable dendrogram.  (I've tried increasing the window 
size and shrinking cex.text, but this still isn't sufficient.)

Cheers,
Phil


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   Phil 
Novack-Gottshall                        pnovackg at westga.edu 


   Assistant Professor
   Department of Geosciences
   University of West Georgia
   Carrollton, GA 30118-3100
   Phone: 678-839-4061
   Fax: 678-839-4071
   http://www.westga.edu/~pnovackg
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#
Hi Phil,

I'd start by asking if you can categorize your species in some sort of  
meaningful manner.  If so you can color code the labels of your  
dendrogram.  The package 'ape' includes 'tip.color=...' as a parameter  
in plot.phylo().  Create a vector of colors or numbers and use that to  
parameterize tip.color.

In regards to viewing the graphic I'd create a pdf.  This allows you  
to parameterize the paper size with width=7 and height=7 (units are  
inches).  You can parameterize this to height=14 or height=28 or other  
large values.  This may be printed over several pages or you can just  
scroll over it on your monitor.

Good luck!
1 day later
#
Hi Thierry,

Thanks for the more detailed report.  I think the new version of
reshape will help, but I just checked and it's current a total mess
and will need a lot of work before it's ready for anyone to try.
Unfortunately I'm unlikely to get to it until the ggplot2 book is
finished, so it might be a bit of a wait.

Hadley

On Tue, Oct 14, 2008 at 2:52 AM, ONKELINX, Thierry
<Thierry.ONKELINX at inbo.be> wrote:

  
    
9 days later
#
Thierry and Hadley,

     Sorry to be late coming into this (I forgot I subscribed to sig-eco).

     package labdsv has a function called matrify() which takes a three 
column data.frame (sample,taxa,abundance) and creates a full (sparse) 
matrix representation.  I've never tried it on a data set as large as 
yours, and I'm curious if it would work.  It's pure R, but if worst 
comes to worst I used to have a FORTRAN version that would probably 
work. Please give matrify a try and let me know.

Dave R.

matrify <- function (data)
{
     if (ncol(data) != 3)
         stop("data frame must have three column format")
     plt <- data[, 1]
     spc <- data[, 2]
     abu <- data[, 3]
     plt.codes <- levels(factor(plt))
     spc.codes <- levels(factor(spc))
     taxa <- matrix(0, nrow = length(plt.codes), ncol =
              length(spc.codes))
     row <- match(plt, plt.codes)
     col <- match(spc, spc.codes)
     for (i in 1:length(abu)) {
         taxa[row[i], col[i]] <- abu[i]
     }
     taxa <- data.frame(taxa)
     names(taxa) <- spc.codes
     row.names(taxa) <- plt.codes
     taxa
}
hadley wickham wrote:

  
    
2 days later
#
Dear Dave,

Below you'll find a testreport of your function. It works fine with my
dataset allthough it is slower than the plain and simple table function.
But that off course will only work with presence-only data like I have.
On the other hand: it is three time faster than the mefa package.

HTH,

Thierry
+ {
+      if (ncol(data) != 3)
+          stop("data frame must have three column format")
+      plt <- data[, 1]
+      spc <- data[, 2]
+      abu <- data[, 3]
+      plt.codes <- levels(factor(plt))
+      spc.codes <- levels(factor(spc))
+      taxa <- matrix(0, nrow = length(plt.codes), ncol =
+               length(spc.codes))
+      row <- match(plt, plt.codes)
+      col <- match(spc, spc.codes)
+      for (i in 1:length(abu)) {
+          taxa[row[i], col[i]] <- abu[i]
+      }
+      taxa <- data.frame(taxa)
+      names(taxa) <- spc.codes
+      row.names(taxa) <- plt.codes
+      taxa
+ }
sysname                      release 
                   "Windows"                         "XP" 
                     version                     nodename 
"build 2600, Service Pack 2"                 "LHPA000838" 
                     machine                        login 
                       "x86"           "thierry_onkelinx" 
                        user 
          "thierry_onkelinx"
R version 2.8.0 (2008-10-20) 
i386-pc-mingw32 

locale:
LC_COLLATE=Dutch_Belgium.1252;LC_CTYPE=Dutch_Belgium.1252;LC_MONETARY=Du
tch_Belgium.1252;LC_NUMERIC=C;LC_TIME=Dutch_Belgium.1252

attached base packages:
[1] stats     graphics  grDevices datasets  tcltk     utils     methods

[8] base     

other attached packages:
[1] RODBC_1.2-3    svSocket_0.9-5 svIO_0.9-5     R2HTML_1.59
svMisc_0.9-5  
[6] svIDE_0.9-5   

loaded via a namespace (and not attached):
[1] tools_2.8.0
Location, TaxonFK AS Species FROM kmhok_periode2_selectie ORDER BY
KMhokcode, TaxonFK", as.is = TRUE)
[1] 1157024       2
[1] 6354
[1] 1381
user  system elapsed 
   1.32    0.26    1.58
[1] 1157024       3
user  system elapsed 
  10.81    0.58   11.39
This is mefa 2.0-1
user  system elapsed 
  27.05    0.76   28.61
------------------------------------------------------------------------
----
ir. Thierry Onkelinx
Instituut voor natuur- en bosonderzoek / Research Institute for Nature
and Forest
Cel biometrie, methodologie en kwaliteitszorg / Section biometrics,
methodology and quality assurance
Gaverstraat 4
9500 Geraardsbergen
Belgium 
tel. + 32 54/436 185
Thierry.Onkelinx at inbo.be 
www.inbo.be 

To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to
say what the experiment died of.
~ Sir Ronald Aylmer Fisher

The plural of anecdote is not data.
~ Roger Brinner

The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of
data.
~ John Tukey

-----Oorspronkelijk bericht-----
Van: Dave Roberts [mailto:droberts at montana.edu] 
Verzonden: vrijdag 24 oktober 2008 20:11
Aan: r-sig-ecology at r-project.org
CC: ONKELINX, Thierry
Onderwerp: Re: [R-sig-eco] Clustering large data

Thierry and Hadley,

     Sorry to be late coming into this (I forgot I subscribed to
sig-eco).

     package labdsv has a function called matrify() which takes a three 
column data.frame (sample,taxa,abundance) and creates a full (sparse) 
matrix representation.  I've never tried it on a data set as large as 
yours, and I'm curious if it would work.  It's pure R, but if worst 
comes to worst I used to have a FORTRAN version that would probably 
work. Please give matrify a try and let me know.

Dave R.

matrify <- function (data)
{
     if (ncol(data) != 3)
         stop("data frame must have three column format")
     plt <- data[, 1]
     spc <- data[, 2]
     abu <- data[, 3]
     plt.codes <- levels(factor(plt))
     spc.codes <- levels(factor(spc))
     taxa <- matrix(0, nrow = length(plt.codes), ncol =
              length(spc.codes))
     row <- match(plt, plt.codes)
     col <- match(spc, spc.codes)
     for (i in 1:length(abu)) {
         taxa[row[i], col[i]] <- abu[i]
     }
     taxa <- data.frame(taxa)
     names(taxa) <- spc.codes
     row.names(taxa) <- plt.codes
     taxa
}
hadley wickham wrote:
The
smaller
LC_COLLATE=Dutch_Belgium.1252;LC_CTYPE=Dutch_Belgium.1252;LC_MONETARY=Du
methods
fill
fill
------------------------------------------------------------------------
Nature
more
to
not
for
table()
schrijver weer
bevestigd is
message
as stating
by a duly