Skip to content

Slow writing of point features to SpatialLite-DB or Geopackage

4 messages · Manuel Schneider, Roger Bivand, Loïc Dutrieux +1 more

#
Dear list

I am searching alternatives to ESRI shapefiles for the storage of GPS data, i.e. tagged point features, and came across SpatialLite or Geopackage. Unfortunately writing to both formats is very slow compared to shapefiles making practical use impossible.

library(sf)
library(rgdal)
library(RSQLite)

n<- 1000
d <-data.frame(a=1:n, X=rnorm(n,1,1), Y=rnorm(n,1,1))
mp1 <- st_as_sf(d, coords=c("X","Y"))

t1 <- system.time(st_write(mp1, dsn = 'C:/Temp/data1.shp', driver = 'ESRI Shapefile'))
t2 <- system.time(st_write(mp1, dsn = 'C:/Temp/test.sqlite', layer = 'data1', driver = 'SQLite'))
t3 <- system.time(st_write(mp1, "C:/Temp/data1.gpkg"))

rbind(t1,t2,t3)[,1:3]

   user.self sys.self elapsed
t1      0.03     0.03    0.09
t2      0.53     5.04   29.33
t3      0.48     4.29   32.19

As n increases, processing time explodes for SpatialLite and Geopackage, and I usually have a couple of 10000 points to store. Any experiences of others would be highly appreciated.
Many thanks
Manuel


------
R version 3.4.1 (2017-06-30)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 7 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=German_Switzerland.1252  LC_CTYPE=German_Switzerland.1252   
[3] LC_MONETARY=German_Switzerland.1252 LC_NUMERIC=C                       
[5] LC_TIME=German_Switzerland.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] sf_0.5-3    RSQLite_2.0 rgdal_1.2-8 sp_1.2-5   

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.12    lattice_0.20-35 digest_0.6.12   grid_3.4.1      DBI_0.7        
 [6] magrittr_1.5    units_0.4-5     rlang_0.1.2     blob_1.1.0      tools_3.4.1    
[11] udunits2_0.13   bit64_0.9-7     bit_1.1-12      compiler_3.4.1  memoise_1.1.0  
[16] tibble_1.3.4
#
On Thu, 24 Aug 2017, manuel.schneider at agroscope.admin.ch wrote:

            
Fedora 26 64-bit:

n 1000
user.self sys.self elapsed
t1     0.007    0.001   0.010
t2     0.067    0.035   0.103
t3     0.029    0.042   0.073

n 25000
user.self sys.self elapsed
t1     0.120    0.032   0.153
t2     0.412    0.829   1.247
t3     0.645    0.834   1.487

R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Fedora 26 (Workstation Edition)
other attached packages:
[1] sf_0.5-3

loaded via a namespace (and not attached):
[1] compiler_3.4.1 magrittr_1.5   tools_3.4.1    DBI_0.7
   units_0.4-5
[6] Rcpp_0.12.12   udunits2_0.13  grid_3.4.1

There is no need to load rgdal or RSQLite, neither are needed or used. For 
portability use tempdir():

t1 <- system.time(st_write(mp1, dsn = paste0(td, 'data1.shp')))
t2 <- system.time(st_write(mp1, dsn = paste0(td, 'test.sqlite'), layer = 
'data1', driver = 'SQLite'))
t3 <- system.time(st_write(mp1, paste0(td, 'data1.gpkg')))

Maybe an order of magnitude difference because the databases need 
initialising, but nothing like your scale; does 32/64 bit make a 
difference?

I'm assuming that you installed sf as a Windows binary from CRAN?

Consider using a github issue when others have tried tis out on other 
platforms.

Roger

  
    
#
On 24/08/17 10:23, Roger Bivand wrote:
I also get large differences on ubuntu 16.04 64-bits with ssd; 
particularly when writing a second layer to an existing geopackage

library(sf)

n <- 1000
d <- data.frame(a=1:n, X=rnorm(n,1,1), Y=rnorm(n,1,1))
mp1 <- st_as_sf(d, coords=c("X","Y"))

td <- tempdir()
file.remove(list.files(td, full.names = TRUE))

t1 <- system.time(st_write(mp1, dsn = file.path(td, 'data1.shp'), driver 
= 'ESRI Shapefile'))
t2 <- system.time(st_write(mp1, dsn = file.path(td, 'data2.sqlite'), 
layer = 'layer1', driver = 'SQLite'))
t3 <- system.time(st_write(mp1, dsn = file.path(td, 'data2.sqlite'), 
layer = 'layer2', driver = 'SQLite'))
t4 <- system.time(st_write(mp1, dsn = file.path(td, 'data3.gpkg'), layer 
= 'layer1'))
t5 <- system.time(st_write(mp1, dsn = file.path(td, 'data3.gpkg'), layer 
= 'layer2'))

rbind(t1,t2,t3,t4,t5)[,1:3]

    user.self sys.self elapsed
t1     0.012    0.000   0.010
t2     0.180    0.456   8.993
t3     0.220    0.460  10.637
t4     0.016    0.064   0.082
t5     0.200    0.472   9.199

R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS

other attached packages:
[1] sf_0.5-3     raster_2.5-8 sp_1.2-4

loaded via a namespace (and not attached):
  [1] compiler_3.4.0  magrittr_1.5    DBI_0.6-1       tools_3.4.0 
units_0.4-5     yaml_2.1.14     Rcpp_0.12.10    udunits2_0.13 
grid_3.4.0      lattice_0.20-35

Cheers,
Lo?c
#
According to http://gdal.org/drv_sqlite.html you can unset
OGR_SQLITE_SYNCHRONOUS and it should improve performance.  So, use

Sys.setenv(OGR_SQLITE_SYNCHRONOUS="OFF")

in your script/session before you wish to call writes.?  The SQLite and
GPKG writes are still 11x and 13x respectively slower than shapefile writes
but not terribly bad.  Below is what resulted in these numbers.  Output
from my .Rout file follows:
Linking to GEOS 3.5.1, GDAL 2.2.0, proj.4 4.9.2
logical(0)
+   if (file.exists(dsn)) unlink(dsn)
+   st_write(dat,dsn=dsn, ...)
+ }
+   stwrite(mp1, dsn=file.path(td,'data1.shp'), driver='ESRI Shapefile'),
+   stwrite(mp1, dsn=file.path(td,'data2.sqlite'), layer="layer1",
driver='SQLite'),
+   stwrite(mp1, dsn=file.path(td,'data2.sqlite'), layer="layer2",
driver='SQLite'),
+   stwrite(mp1, dsn=file.path(td,'data3.gpkg'), layer="layer1"),
+   stwrite(mp1, dsn=file.path(td,'data3.gpkg'), layer="layer2"),
+   replications=5
+ )


test
1              stwrite(mp1, dsn = file.path(td, "data1.shp"), driver =
"ESRI Shapefile")
2 stwrite(mp1, dsn = file.path(td, "data2.sqlite"), layer = "layer1",
driver = "SQLite")
3 stwrite(mp1, dsn = file.path(td, "data2.sqlite"), layer = "layer2",
driver = "SQLite")
4                      stwrite(mp1, dsn = file.path(td, "data3.gpkg"),
layer = "layer1")
5                      stwrite(mp1, dsn = file.path(td, "data3.gpkg"),
layer = "layer2")
  replications elapsed relative user.self sys.self user.child sys.child
1            5   0.034    1.000     0.028    0.008          0         0
2            5   0.374   11.000     0.060    0.312          0         0
3            5   0.359   10.559     0.084    0.272          0         0
4            5   0.454   13.353     0.192    0.260          0         0
5            5   0.451   13.265     0.172    0.280          0         0
R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.3 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] tools     stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
[1] rbenchmark_1.0.0 sf_0.5-4

loaded via a namespace (and not attached):
[1] compiler_3.4.1 magrittr_1.5   DBI_0.7        units_0.4-5
Rcpp_0.12.12
[6] udunits2_0.13  grid_3.4.1
user  system elapsed
  1.436   1.340   2.791




On Thu, Aug 24, 2017 at 5:44 PM, Lo?c Dutrieux <loic.dutrieux at conabio.gob.mx