GStat 'predict.c' parallel estimation
On 27/08/15 16:55, Edzer Pebesma wrote:
On 08/27/2015 02:45 AM, Tim Peterson wrote:
Hi all, I'm considering editing the GStat kriging file 'predict.c' (while loop at line 151-165) for parallel estimation. I require parallel estimation for both kriging and indicator simulation and will be applying it to a region of ~30 million grid cells, which will be repeated for 300 time points. I appreciate that snow can be used for parallel kriging (https://r-forge.r-project.org/scm/viewvc.php/pkg/demo/snow.R?view=markup&revision=89&root=gstat) but, because of my computation burden, I require parallel calculation using Xeon Phi cards - hence I think the changes should be in 'predict.c'. Also, in looking at the GStat C-code I suspect that the global variables in 'glvars.c' may complicate the parallel calculations. So anyway, before I embark upon this task (most likely using openMP), I'd be grateful if anyone could share tips or thoughts (and my edits will be freely shared.)
I've always thought of these global variables as being constant during program execution. R being non-thread safe, I'm not sure how well R combines with embedded C code that uses threads.
Thanks again Edzer. That's good to know about the global variables. Re the threaded C code working with R, my understanding was that the while loop undertaking the estimation is not interacting with R and so can be made parallel without (hopefully) influencing R. Does this sound reasonable?
Also, some may be questioning the parallel estimation of each simulation. Considering that I'm estimating ~30 million cells, and will only be using ~70 cores per simulation, I am happy to accept the deficiency that recently estimated cells will not be used in the estimation of nearby cells within a concurrent parallel cycle. However, the estimates do need to be accessible to latter parallel cycles.
I can see that a shared memory approach could mess up the set of
previously simulated values ("msim" in msim.c), and that filling in
locally could be done in independent threads; yet, what do you do at the
boundaries of these regions?
To clarify, the best approach I see for simulations is to randomly select, say, 10*"number of cores" grid cells and undertake their parallel estimation. These solutions are then added back into the "msim" and another 10*"number of cores" grid cells are then selected and estimated. This would overcome the boundary effects but would mean that the cores may not be running at 100% all of the time (ie all 10*"number of cores" cells would need to finish before the next batch could start)
As an alternative to this approach (and the one you sketch below) you could think about a circulant embedding algorithm that uses a parallel version of the fft; see fields:sim.rf and the routines in package RandomFields.
Thanks. I didn't know about circulant embedding. However, my problem is high non-bi-Gaussian. The extreme values within the system (which is water table elevation) are significantly more correlated than 'average' values.
Lastly, my backup option is to use openMP to parallelise the GSLib Fortran subroutines 'cokb3d.for' and 'sisim.for' (ie outside of R). This is mainly because the subroutines are fairly standalone and don't use pointers - and hence possibly less problematic to make parallel. Many thanks, Tim ---------------------- Dr. Tim Peterson The Department of Infrastructure Engineering The University of Melbourne, 3010 Australia T: +61 3 8344 9950, M: +61 0438 385 937 Dept. profile : http://www.ie.unimelb.edu.au/people/staff.php?person_ID=141135 Research Gate : https://www.researchgate.net/profile/Tim_Peterson7 Google Scholar: http://scholar.google.com.au/citations?user=kkYJLF4AAAAJ&hl=en&oi=ao
_______________________________________________ R-sig-Geo mailing list R-sig-Geo at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-geo
_______________________________________________ R-sig-Geo mailing list R-sig-Geo at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-geo