Also in my computer reshape did not work for large datasets. I conturned the problem by writing my own program making using loops. This takes a few hours for a non-expert like me (my code is slow and not portable but it works ....). Another possibility that came to my mind would be to run the data transformation for example in SAS and reexport the data to R. I think that the R data transformation procedures (like reshape) are not the most efficient ones. Frank
r-sig-ecology-request at r-project.org wrote:
Send R-sig-ecology mailing list submissions to r-sig-ecology at r-project.org To subscribe or unsubscribe via the World Wide Web, visit
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology or, via email, send a message with subject or body 'help' to r-sig-ecology-request at r-project.org You can reach the person managing the list at r-sig-ecology-owner at r-project.org When replying, please edit your Subject line so it is more specific than "Re: Contents of R-sig-ecology digest..." Today's Topics: 1. Clustering large data (ONKELINX, Thierry) 2. Re: Clustering large data (tyler) 3. Re: Clustering large data (Peter Solymos) 4. Re: Clustering large data (Farrar.David at epamail.epa.gov) 5. Re: Clustering large data (Christian A. Parker) 6. Re: Clustering large data (Farrar.David at epamail.epa.gov) 7. Re: Clustering large data (Brian Campbell) 8. Re: Clustering large data (Christian A. Parker) 9. Mortality anslisis (Marcelo Luiz de Laia) ---------------------------------------------------------------------- Message: 1 Date: Tue, 7 Oct 2008 12:12:28 +0200 From: "ONKELINX, Thierry" <Thierry.ONKELINX at inbo.be> Subject: [R-sig-eco] Clustering large data To: <r-sig-ecology at r-project.org> Message-ID: <2E9C414912813E4EB981326983E0A10405903F59 at inboexch.inbo.be> Content-Type: text/plain; charset="us-ascii" Dear all, We have a problem with a large dataset that we want to cluster. The dataset is in a long format: 1154024 rows with presence data. Each row has the name of the species and the location. We have 1381 species and 6354 locations. The main problem is that we need the data in wide format (one row for each location, one column for each species) for the clustering algorithms. But the 6354 x 1381 dataframe is too big to fit into the memory. At least when we use cast from the reshape package to convert the dataframe from a long to a wide format. Are there any clustering tools available that can work with the data in a long format or with sparse matrices (only 13% of the matrix is non-zero)? If the work with sparse matrices: how to convert our dataset to a sparse matrix? Other suggestions are welcome. We are working with R 2.7.2 on WinXP with 2 GB RAM. --max-mem-size is set to 2047M. Thanks, Thierry ------------------------------------------------------------------------ ---- ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest Cel biometrie, methodologie en kwaliteitszorg / Section biometrics, methodology and quality assurance Gaverstraat 4 9500 Geraardsbergen Belgium tel. + 32 54/436 185 Thierry.Onkelinx at inbo.be www.inbo.be To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey Dit bericht en eventuele bijlagen geven enkel de visie van de schrijver weer en binden het INBO onder geen enkel beding, zolang dit bericht niet bevestigd is door een geldig ondertekend document. The views expressed in this message and any annex are purely those of the writer and may not be regarded as stating an official position of INBO, as long as the message is not confirmed by a duly signed document. ------------------------------ Message: 2 Date: Tue, 07 Oct 2008 09:35:39 -0300 From: tyler <tyler.smith at mail.mcgill.ca> Subject: Re: [R-sig-eco] Clustering large data To: r-sig-ecology at r-project.org Message-ID: <87zllg7fc4.fsf at blackbart.sedgenet> Content-Type: text/plain; charset=us-ascii "ONKELINX, Thierry" <Thierry.ONKELINX at inbo.be> writes: Dear all, We have a problem with a large dataset that we want to cluster. The dataset is in a long format: 1154024 rows with presence data. Each row has the name of the species and the location. We have 1381 species and 6354 locations. The main problem is that we need the data in wide format (one row for each location, one column for each species) for the clustering algorithms. But the 6354 x 1381 dataframe is too big to fit into the memory. At least when we use cast from the reshape package to convert the dataframe from a long to a wide format. Are there any clustering tools available that can work with the data in a long format or with sparse matrices (only 13% of the matrix is non-zero)? If the work with sparse matrices: how to convert our dataset to a sparse matrix? Other suggestions are welcome. 6354 x 1381 should be well within your memory limit, so I assume it's the intermediate steps that are fouling you up. Maybe you can do it in pieces: 1. subset the original two-column matrix to include only the first 100 sites 2. convert this subset to wide form 3. repeat 63 times for different subsets 4. rbind the resulting matrices Good luck, Tyler