Skip to content

skater - spdep runtime - geographic territories

3 messages · Salo V, Elias T. Krainski

#
Hi Everyone,

I am trying to run the skater function for graph partitions, part of the
spdep package. My goal is to create contiguous territories for the entire
USA at the ZIP Code level.

The function takes a very long time to run even for ~15% of my total areas.
I am looking to run this for the 30,000 ZIP Codes in the USA.

The skater function documentation gives an example of parallel processing,
but it doesn?t seem to be speeding things up. I have a windows laptop with
2 physical cores and 4 logical cores. In the below code, I have already
tried to set nc = 1, nc=2 and nc=4 all with very similar results in time.

Has anyone been able to run the skater function for a large amount of areas
in a reasonable amount of time? Would really appreciate any guidance on
this, perhaps I am missing steps.



Here is the example from the documentation and which I am also running.

*library*(parallel)

nc <- detectCores(logical=FALSE)

# set nc to 1L here

*if* (nc > 1L) nc <- 1L

coresOpt <- get.coresOption()

invisible(set.coresOption(nc))

*if*(!get.mcOption()) {

# no-op, "snow" parallel calculation not available

  cl <- makeCluster(get.coresOption())

  set.ClusterOption(cl)

}

### calculating costs

system.time(plcosts <- nbcosts(bh.nb, dpad))

all.equal(lcosts, plcosts, check.attributes=FALSE)

### making listw

pnb.w <- nb2listw(bh.nb, plcosts, style="B")

### find a minimum spanning tree

pmst.bh <- mstree(pnb.w,5)

### three groups with no restriction

system.time(pres1 <- skater(pmst.bh[,1:2], dpad, 2))

*if*(!get.mcOption()) {

  set.ClusterOption(NULL)

  stopCluster(cl)

}


much appreciated!
#
Hi Salo,

I have implemented it several years ago and this is not optimal some 
ways. I will update it in near future to account for an heuristic to 
avoid the exhaustive search that it performs. For now, you can find a 
significant runtime reduction considering an alternative function to 
compute the ssw because the way it does by default uses a lot of memory 
and is bad for big datasets.

Please consider the attached code that illustrates this fact. When using 
the ssdfun() I experienced a reduction factor around 4 for n=2k. I found 
an additional reduction factor of 1.6 by using two (physical) cores. 
This is the result I got on my laptop:

 ????? n t1 t2 t3 t4
15? 225? 1? 1? 1? 1
20? 400? 1? 1? 1? 1
25? 625? 4? 3? 3? 2
30? 900 10? 5? 6? 4
35 1225 21? 8 13? 5
40 1600 39 12 23? 8
45 2025 86 24 50 15

best regards,

Elias
On 6/11/19 5:21 PM, Salo V wrote:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: elapsed-time-ssdfun.R
Type: text/x-r-source
Size: 2521 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-geo/attachments/20190611/bc768363/attachment.bin>
#
Hi Salo,

In the file I sent attached to my previous email the ssdfun() has to be 
replaced by the following in order to give the same results as the 
default option in skater():

ssdfun <- function(d,i)
 ??? sum(sqrt(colSums((t(d[i,,drop=FALSE])-
 ????????????????????? colMeans(d[i,,drop=FALSE]))^2)))

So, the recommendation is to use skater(..., method=ssdfun)

Best regards,

Elias
On 6/11/19 5:21 PM, Salo V wrote: