point.in.polygon() on massive datasets

On Thu, 13 Dec 2007, Edzer Pebesma wrote:

A fix using polygon bounding box pre-selection of points is on r-spatial 
CVS.

For Markus' problem, it should help a good deal - for larger numbers of 
points I'm seeing over an order of magnitude speedups (100,000 point on 
100 polygons on an older box in about 10 seconds, about 100 on a much 
faster box). I'd take 5M point chunks, possibly by subsetting the points 
spatially.

For Ingo's problem (many polygons), it may help directly, or perhaps 
taking subsets of polygons may still be needed. It still has to loop over 
the list of polygons, there is no easy way round this. An alternative 
might be using nearest neighbours to polygon centroids

Until this is tried out, the checkout is for a CVS copy of the source:

http://sourceforge.net/cvs/?group_id=84357, modulename sp

If you'd like a copy of the draft package as a source tarball or Windows 
binary, please let me know, because this need to be compared against the 
actual problems before it is released.

Roger
overlay for points in polygons is a wrapper around point.in.polygon, the
same that is used in gstat. What I miss in the R code (and C source
code) is e.g. the checking on the bbox, whether the point is within the
bounding box of a polygon, although the bbox is computed.
That's right - as it is there is an lapply() on the Polygon objects in
each Polygons object with hole handling inside an lapply() on the Polygons
objects in the SpatialPolygons object.

res <- lapply(slot(xx, "polygons"), bbox)

gets the bbox'es of the xx SpatialPolygons* object, and you could build a
fast searcher in C with code in spdep/src/insiders.c:

int between(double x, double low, double up) {
	if (x >= low && x <= up) return(1);
	else return(0);
}

int pipbb(double pt1, double pt2, double *bbs) {
	if ((between(pt1, bbs[0], bbs[2]) == 1) &&
		(between(pt2, bbs[1], bbs[3]) == 1)) return(1);
	else return(0);
}

in a compiled loop. You'd look to get a list 50M long with integer vectors
for candidate members, then only do p-in-p for the candidate Polygons
objects.

Roger

It would of course speed things up increadibly when using tree indexes,
either on the points or on the polygons, or preferably both, but this is
currently not in the sp code. A nice student project!
--
Edzer

Barry Rowlingson schrieb:
Markus Loecher wrote:

Dear all,
I have a dataset of about 50 million lat/lon coordinates each of which falls into one of 550 polygons.
I need to assign their memberships and have used point.in.polygon() for that purpose.
However, the simple way of looping over the 50 million points clearly takes a looong time; 1 million points took about 3-4 days on a fast Linux server with lots of memory.
Am I overlooking obvious ways of making this massive computation more efficient ? Would R trees help ?
Should I try to compile the C code for point.in.polygon() (available from gstat) and run it outside R as a standalone executable ?
I am already using apply() to mitigate the inefficiency of the for loop in R.

Any help would be greatly appreciated,

  Have you tried the 'overlay' functions from the sp package? Overlaying
points on polygons using those checks all the polygons for each point in
one go, so it may do some spatial tree optimising... You might have to
do your 50 million points in batches though...

Barry

_______________________________________________
R-sig-Geo mailing list
R-sig-Geo at stat.math.ethz.ch
https://stat.ethz.ch/mailman/listinfo/r-sig-geo

_______________________________________________
R-sig-Geo mailing list
R-sig-Geo at stat.math.ethz.ch
https://stat.ethz.ch/mailman/listinfo/r-sig-geo

Roger Bivand
Economic Geography Section, Department of Economics, Norwegian School of
Economics and Business Administration, Helleveien 30, N-5045 Bergen,
Norway. voice: +47 55 95 93 55; fax +47 55 95 95 43
e-mail: Roger.Bivand at nhh.no

point.in.polygon() on massive datasets

Thread (5 messages)