read.gal() for large data sets
On Tue, 23 Aug 2011, Juta Kawalerowicz wrote:
Dear List,
I am looking for a general strategy for the following problem. I have a
large data set with 200 000 rows and 50 variables. Along the lines of
Anselin's R workbook I have used GeoDa to creata weights (200 Mb file),
but when I am trying to do read them into R by using
library(spdep)
weights<-read.gal("weights.gal")
it does not seem to work (or maybe I should to wait for more than one
hour?).
read.gal() is quite complicated inside, because the IDs used may not be the integers 1:n, so needs to read in the data and manipulate it a good deal. I think that you also have many neighbours, as if you have a 200MB file and use integers 1:200000, taking 7 characters per integer on average, you have almost 1500 neighbours. This is far from sparse. I suggest generating the neighbour object in R directly if you can, as the smoothing effect of such a large number of neighbours on average may be very powerful, and may not represent the spatial process adequately. Depending on what you want to do with the data, you may prefer a graph-based or kNN approach.
My computer runs on i7-2630QM CPU with 4 GB RAM. Any suggestions? In principle - could somebody advise me on what are the strategies for spatial analysis on large data sets?
4GB is not large, as most newer machines can run in 64-bit mode, and handle much more without trouble, this sounds like a standard laptop. I don't think that there is an obvious answer to your question, as approaches will vary greatly depending on what kind of analysis you want to do, and at least partly on whether the data are planar or use geographical (unprojected) coordinates, forcing the use of Great Circle distances. If your analysis is embarassingly parallelisable and you have plenty of memory, you can use all your cores at once; you need more memory because each core uses the data and in most systems needs its own copy of part of the data set. One copy of your data as a matrix is about 80MB in the R workspace, which isn't large as such; the "lm" object from regressing one column on the others is about 200MB, but can be made smaller. But whether say Moran's I of 200000 observations tells a great deal is another matter, it depends on the problem you are analysing and how you have set out your model. Hope this clarifies, Roger
Thanks, Juta
_______________________________________________ R-sig-Geo mailing list R-sig-Geo at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Roger Bivand Department of Economics, NHH Norwegian School of Economics, Helleveien 30, N-5045 Bergen, Norway. voice: +47 55 95 93 55; fax +47 55 95 95 43 e-mail: Roger.Bivand at nhh.no