Skip to content

Problem with size of dataset

5 messages · Poizot Emmanuel, Tomislav Hengl, Pierre Roudier

#
Dear all,

I would like to perform a geostatistical analysis using R.
To do so, I'm classicaly use geoR ou GSTAT packages.
In the present case, I have a dataset of around 157000 locations where I 
have a value (depth).
I've been not able to create both geodata or gstat valid R object 
because apparently of the size of the dataset.
DOes anybody have an idear of how to conduct such study with such 
dataset size ?
Regards
#
Hi Emmanuel,

This is surprising has I am currently working on a dataset with 215
000+ locations. I managed to create a sp object
(SpatialPointsDataFrame) seamlessly. I also can use it in gstat (while
time-consumming) for analysis like variograms.

Do you have any error message?

Pierre
#
Pierre Roudier a ?crit :
Hi Pierre,

when I try to load the dataset with the following command:

g.fond<-as.geodata(nav.reflector.corrected[,1:3])
with "nav.reflector.corrected" a dataframe with respectively x, y, and 
depth columns, it gives the following message:

as.geodata: 215 points removed due to NA in the data
as.geodata: 4 redundant locations found
Erreur dans vector("double", length) :
  la longueur choisie pour le vecteur est trop grande

(Sorry for the french)
#
Dear Emmanuel,

I have the same problem. I either can not run processing with large data set in R or I can not even
load such data to R. Then, if I want to do any geostatistics, it takes forever. R (gstat/geoR) is
simply not that efficient with large spatial data as e.g. GIS software.

What you can definitively try is to subset your point data randomly by using e.g.:
This will allow you to fit variograms etc.

Then, if you really want to interpolate all of your 157k points, you might consider using SAGA. For
example, you can also sub-set randomly points from a shape file:

# too many points; subset to 5% ("Split Shapes Layer Randomly" in SAGA):
A="part_A.shp", B="part_B.shp", PERCENT=5))
# Learn more about geostatistics in SAGA:
$geostatistics_kriging
   code                           name
1     0          Ordinary Kriging (VF)
2     1  Ordinary Kriging (VF, Global)
3     2         Universal Kriging (VF)
4     3 Universal Kriging (VF, Global)
5     4        Semivariogram (Dialog))
6     5               Ordinary Kriging
7     6      Ordinary Kriging (Global)
8     7              Universal Kriging
9     8     Universal Kriging (Global)
10   NA                           <NA>
11   NA                           <NA>

# Read the mask map:
# Ordinary kriging in SAGA:
SHAPES="var.shp", BVARIANCE=F, BLOCK=F, FIELD=1, BLOG=F, MODEL=1, TARGET=0, NPOINTS_MIN=10,
NPOINTS_MAX=60, NUGGET=rvgm.Pb$psill[1], SILL=1.65, RANGE=1238, MAXRADIUS=50000,
USER_CELL_SIZE=cell.size, USER_X_EXTENT_MIN=gridmaps at bbox[1,1]+cell.size/2,
USER_X_EXTENT_MAX=gridmaps at bbox[1,2]-cell.size/2, USER_Y_EXTENT_MIN=gridmaps at bbox[2,1]+cell.size/2,
USER_Y_EXTENT_MAX=gridmaps at bbox[2,2]-cell.size/2))
# the same way you can run regression-kriging/universal kriging;

You will soon find out that there is a big difference in the efficiency between SAGA and R - SAGA
will interpolate your 157k points within few minutes or less. On the other hand, SAGA has very very
limited geostatistical functionality (for example it can not fit variograms etc.), so what you
really need is a combination of SAGA and R!

Here are more examples:
http://geomorphometry.org/view_scripts.asp?id=24 

HTH,

T. Hengl
http://spatial-analyst.net
#
I can not reproduce Emmanuel's bug with my configuration: I loaded a
216 149 observations matrix into a geodata object (using geoR) and
also in a gstat object.

But as Tomislav notice, this is pretty unusable and a simple variogram
pmlotting takes a long time. I do subsample this dataset.

For the record, here are the different versions of my configuration
(on a x86_64 linux box):
R : 2.9.1
geoR: 1.6-25

Hope it helps,

Pierre

2009/7/16 Tomislav Hengl <hengl at spatial-analyst.net>: