Skip to content

Is a simple feature -friendly version of spdep being developed?

8 messages · Tiernan Martin, Roger Bivand, Michael Sumner +1 more

#
Is anyone thinking about creating an adaptation of the `spdep` package that
expects sf-class inputs and works well in a pipeline?

I understand that there is skepticism about the wisdom of adopting the
?tidyverse? principles throughout the R package ecosystem, and I share the
concern that an over-reliance on any single paradigm could reduce the
resilience and diversity of the system as a whole.

That said, I believe that the enthusiastic adoption of the `sf` package and
the package's connections with widely-used tidyverse packages like `dplyr`
and `ggplot2` may result in increased demand for sf-friendly spatial
analysis tools. As an amateur who recently started using R as my primary
GIS tool, it seems like the tidyverse's preference for dataframes, S3
objects, list columns, and pipeline workflows would be well-suited to the
field of spatial analysis. Are there some fundamental reasons why the
`spdep` tools cannot (or should not) be adapted to the tidyverse "dialect"?

Let me put the question in the context of an actual analysis: in February
2017, the pop culture infovis website The Pudding (https://pudding.cool/)
published an analysis of regional preferences for Oscar-nominated films in
the US (https://pudding.cool/2017/02/oscars_so_mapped/). A few days ago,
the author posted a tutorial explaining the method of ?regional smoothing?
used to create the article?s choropleths (
https://pudding.cool/process/regional_smoothing/).

The method relies on several `spdep` functions (
https://github.com/polygraph-cool/smoothing_tutorial/blob/master/smoothing_tutorial.R).
In the code below, I provide reprex with a smaller dataset included in the
`sf` package:

library(sf)
library(spdep)

nc <- st_read(system.file("shape/nc.shp", package = "sf"))  # North
Carolina counties
nc_shp <- as(nc,'Spatial')

coords <- coordinates(nc_shp)
IDs<-row.names(as(nc_shp, "data.frame"))

knn5 <- knn2nb(knearneigh(coords, k = 5), row.names = IDs)  # find the
nearest neighbors for each county
knn5 <- include.self(knn5)

localGvalues <- localG(x = as.numeric(nc_shp at data$NWBIR74), listw =
nb2listw(knn5, style = "B"), zero.policy = TRUE) # calculate the G scores
localGvalues <- round(localGvalues,3)

nc_shp at data$LOCAL_G <- as.numeric(localGvalues)

p1 <- spplot(nc_shp, c('NWBIR74'))
p2 <- spplot(nc_shp, c('LOCAL_G'))
plot(p1, split=c(1,1,2,2), more=TRUE)
plot(p2, split=c(1,2,2,2), more=TRUE)

Here?s what I imagine that would look like in a tidyverse pipeline (please
note that this code is for illustrative purposes and will not run):

library(tidyverse)
library(purrr)
library(sf)
library(sfdep) # this package doesn't exist (yet)

nc <- st_read(system.file("shape/nc.shp", package = "sf"))

nc_g <-
  nc %>%
  mutate(KNN = map(.x = geometry, ~ sfdep::st_knn(.x, k = 5, include.self =
TRUE)),  # find the nearest neighbors for each county
         NB_LIST = map(.x = KNN, ~ sfdep::st_nb_list(.x, style = 'B')),  #
make a list of the neighbors using the binary method
         LOCAL_G = sfdep::st_localG(x = NWBIR74, listw = NB_LIST,
zero.policy = TRUE),  # calculate the G scores
         LOCAL_G = round(LOCAL_G,3))

We can see that the (hypothetical) tidyverse version reduces the amount of
intermediate objects and wraps the creation of the G scores into a single
code chunk with clear steps.

I'd be grateful to hear from the users and developers of the `spdep` and
`sf` packages about this topic!

Tiernan Martin
1 day later
#
On Fri, 12 May 2017, Tiernan Martin wrote:

            
I assume that "this is not a pipeline" is a joke for insiders (in the 
pipe)?

No, there is your issue on the sfr github repository that is relevant for 
contiguous neighbours, but not beyond that:

https://github.com/edzer/sfr/issues/234

An sf is a data.frame, and as such should "just work", like "Spatial" 
objects have, in formula/data settings. The problem is (of course) the 
weights matrix.

Should it be a list column (each row has a list nesting two lists, first 
indices, second non-zero weights), or remain separate as it has been for 
20 years, or become a column-oriented representation (Matrix package) - a 
nested list like a list column or a listw obeject is row-oriented. I had 
started thinking about using a sparse column-oriented representation, but 
have not implemented functions accepting them instead of listw objects.

I am very cautious about creating classes for data that were data.frame, 
then sf, and then have the weights built-in. In the simple case it would 
work, but all you have to do is re-order the rows and the link between the 
neighbour ids and row order breaks down; the same applies to subsetting.

The problems to solve first are related the workflows, and easiest to look 
at in the univariate case (Moran, Geary, join-count, Mantel, G, ...) for 
global and local tests. I think that all or almost all of the NSE verbs 
will cause chaos (mutate, select, group) once weights have been created. 
If there is a way to bind the neighour IDs to the input sf object rows, a 
list column might be possible, but we have to permit multiple such columns 
(Queen, Rook, ...), and ensure that subsetting and row-reordering keep the 
graph relations intact (and their modifications, like row-standardisation)

If you've seen any examples of how tidy relates to graphs (adjacency lists 
are like nb objects), we could learn from that.

How should we go about this technically? spdep is on R-Forge and is happy 
there. Its present functionality has to be maintained as it stands, it has 
too many reverse dependencies to break. Should it be re-written from 
scratch (with more sparse matrices internally for computation)?

Your example creates weights on the fly for a local G* map, but has n=100, 
not say n=90000 (LA census blocks). Using the sf_ geos based methods does 
not use STRtrees, which cut the time needed to find contigious neighbours 
from days to seconds. We ought to pre-compute weights, but this messes up 
the flow of data, because a second lot of data (the weights) have to enter 
the pipe, and be matched with the row-ids.

We'd need proof of concept with realistically sized data sets (not yet NY 
taxis, but maybe later ...). spdep started as spweights, sptests and 
spdep, and the first two got folded into the third when stable. If weights 
are the first thing to go for, sfweights is where to go first (and port 
the STRtree envelope intersections for contiguities). It could define the 
new classes, and tests and modelling would use them.

Thanks for starting this discussion.

Roger

  
    
#
You?re right about one thing: those of us introduced to the pipe in our
formative years find it hard to escape (everything is better here inside
the pipe - right?) ;)

Thank you for your very informative response. Before reading it, I had a
vague inkling that matrices were part of the challenge of adapting these
tools to work with simple features, but I also thought perhaps all this
issue needed was a little nudge from a user like me. Your response made it
clear that there is a series of technical decisions that need to be made,
and in order to do that I imagine some exploratory testing is in order. Of
course, it is possible that the data.frame + list-col foundation is just
not well-suited to the needs of spatial weighting/testing tools, but I
suspect that there is some valuable insight to be gained from exploring
this possibility further. And as I stated in my first post, going forward I
expect more users will be interested in seeing this sort of integration
happen.
has
weights
Sounds to me like this package would need to be built from the ground up,
possibly following a similar path to the `sp` development process as you
mentioned. Maybe that presents a challenge with regards to resources (i.e.,
funding, time, etc.). Perhaps project this is a candidate for future
proposals to the R Consortium or other funding sources. I hope that this
post is useful in documenting user interest in seeing the `sf` protocol
integrated into other spatial tools and workflows, and I look forward to
hearing others' thoughts on the matter.

Thanks again,

Tiernan
On Sat, May 13, 2017 at 1:39 PM Roger Bivand <Roger.Bivand at nhh.no> wrote:

            

  
  
#
On Sun, 14 May 2017, Tiernan Martin wrote:

            
Pipes can be great if they don't involve extra inputs at a later stage in 
the pipe, but as:

https://github.com/thomasp85/tidygraph

suggests, it isn't very easy, and may be forced.
I think that there are two tasks - one to create neighbour objects from sf 
objects, then another to see whether the costs of putting the spatial 
weights into a data.frame (or even more challenging, a tibble) to test for 
spatial autocorrelation or model spatial dependence. Should the weights be 
squirreled away inside the object flowing down the pipe?

sf_object %>% add_weights(...) -> sf_object_with_weights
moran.test(sf_object_with_weights, ...)
geary.test(sf_object_with_weights, ...)

or

sf_object %>% add_weights(...) %>% moran.test(sf_object_with_weights, ...) 
-> output

all in one (output is a htest object). It gets more complex in:

lm.morantest(lm(formula, sf_object), sf_object_with_weights)

as we need the weights and the lm output object.
Making spatial weights from sf objects is just expanding spdep, and is 
feasible. There is no point looking for funding, all it needs is knowledge 
of how things have been done in spdep. Contiguity, knn, distance, 
graph-based neighbours are all feasible.

Extending this to adding weights to sf objects may be going too far. It 
seems slightly forced to - say - attach a sparse matrix to an sf geometry 
column as an attribute, or to add a neighbour nested list column when 
writing subsetting protection is very far from obvious. Like time series 
data (not time stamped observations, but real ordered time series), 
spatial data have implicit order that is tidy in a different way than tidy 
(graph dependencies between observation rows). Maybe mapped features with 
attached attribute data are "tidier" than a table, because they would 
preserve their relative positions?

Roger

  
    
#
I'd propose that the that functions be written in such a way that the
neighbors, weights, and test results can be stored in list cols within the
data.frame.

This way multiple tests (and their inputs) can be stored in a single,
easily comprehensible, rectangular data object.

Here's some more pretend code to illustrate that concept with regard to sf:

  nc %>%
    group_by(NAME) %>%
    nest %>%
    mutate(
           geometry = map(.x = data, ~.x[['geometry']]),
           neighbors = map(.x = data, .f = my_nb), # creates a list col of
the neighbors
           weights = map(.x = data, .f = add_weights), # creates a weights
object (class=wt, is this a sparse matrix in spdep?)
           test_moran = map2(.x = data, .y =  weights, .f = my_moran), #
creates list of Moran's I and sample kurtosis of x
           test_geary = map2(.x = data, .y = weights, .f = my_geary) #
creates list of Geary's C and sample kurtosis of x
               )

#> # A tibble: 100 x 7
#>           NAME              data         geometry  neighbors  weights
test_moran test_geary
#>        <fctr>            <list>           <list>     <list>   <list>
<list>     <list>
#>  1        Ashe <tibble [1 x 14]> <simple_feature> <list [2]> <S3: wt>
<list [2]> <list [2]>
#>  2   Alleghany <tibble [1 x 14]> <simple_feature> <list [5]> <S3: wt>
<list [2]> <list [2]>
#>  3       Surry <tibble [1 x 14]> <simple_feature> <list [4]> <S3: wt>
<list [2]> <list [2]>
#>  4   Currituck <tibble [1 x 14]> <simple_feature> <list [2]> <S3: wt>
<list [2]> <list [2]>
#>  5 Northampton <tibble [1 x 14]> <simple_feature> <list [3]> <S3: wt>
<list [2]> <list [2]>
#>  6    Hertford <tibble [1 x 14]> <simple_feature> <list [6]> <S3: wt>
<list [2]> <list [2]>
#>  7      Camden <tibble [1 x 14]> <simple_feature> <list [5]> <S3: wt>
<list [2]> <list [2]>
#>  8       Gates <tibble [1 x 14]> <simple_feature> <list [2]> <S3: wt>
<list [2]> <list [2]>
#>  9      Warren <tibble [1 x 14]> <simple_feature> <list [5]> <S3: wt>
<list [2]> <list [2]>
#> 10      Stokes <tibble [1 x 14]> <simple_feature> <list [2]> <S3: wt>
<list [2]> <list [2]>
#> # ... with 90 more rows

 Examples of this nested data.frame / list col approach can be found here:
http://r4ds.had.co.nz/many-models.html#nested-data
This seems like the crux of the problem. Are you saying that the orthogonal
"tidyness" of the tidyverse is fundamentally different from the underlying
relationships of spatial data, and therefore there's a limit to the
usefulness of tidyverse approach to spatial analysis?
On Mon, May 15, 2017 at 5:17 AM Roger Bivand <Roger.Bivand at nhh.no> wrote:

            

  
  
#
On Tue, 16 May 2017, Tiernan Martin wrote:

            
Thanks, interesting but really scary. You get numeric output for local 
Moran's I and local Geary, but what will happen next?

All of these tests assume a correctly specified mean model. Without this 
assumption being met (absence of trends, absence of global spatial 
autocorrelation in the local test case, absence of missing right-hand-side 
variables and absence of incorrect functional forms for these, ...), a map 
pretending to invite "inference" of "hotspots" etc. - collections of 
observations with similar values of the local statistic, will most often 
be highly misleading. This isn't only about whether it is possible to do 
technically, but also about not making it simpler than reason would 
suggest. With a correct mean model, there may not be any (local) spatial 
autocorrelation left. See ?localmoran.sad and ?localmoran.exact, and their 
references, and McMillen (2003) (the McSpatial author). I won't even 
mention the need to adjust p.values for multiple comparisons ...
Right, this is central and open. Spatial objects can be tidy in the 
tidyverse sense - sf shows this can be done. The open question is whether 
this extends to the analysis of the data. From looking around, it seems 
that time series and graphs suffer from the same kinds of questions, and 
it will be interesting to see what stars turns up for spatio-temporal 
data:

https://github.com/edzer/stars

I see that stars wisely does not include trip/trajectory data structures, 
but analysis might involve reading the array-based representation of 
spatio-temporal environmental drivers for an ensemble of buffers around 
trajectories (say animal movement), then modelling.

Maybe: "everything should be as tidy as it can be but not tidier"?

Roger

  
    
#
Hello, I'm following this conversation with great interest. There's a lot
here for me
to work with, it's an active pursuit and I don't have much to offer yet but
I have a couple of inline responses below.  I've learnt a lot about spdep
from this discussion, something I've meant to explore for a long time
but my work has never had the same modelling focus).
On Tue, 16 May 2017 at 18:25 Roger Bivand <Roger.Bivand at nhh.no> wrote:

            
I think the structures are still wide open, unless you think the goals of sf
encompass everything. The simple features standard is one corner of the
available ways to structure spatial data. It pointedly ignores relations
between objects, and heavily relies on 2D-specific optimizations. All
the topology is done "on the fly" and is treated as expendable. I personally
want structures that treat simple features and similar formats as
expendable,
but that maintain the topology as a first-class citizen. I don't say this
because I think simple features is not important, it's just not my
main focus. I want time, depth, temperature, salinity, anything and
everything
storable on features, parts, edges, vertices and nodes. Lots of tools
do this or aspects of it, and none of them can be shoe-horned into simple
features, and
they don't have a unifying framework. (It's called "mesh structures", or
the simplicial
complex, or "a database" in other contexts, but none of those are really
enough).

 (I'm dismayed by GIS ideas when they are spoken about as if it's the end
of the story. Sf is really great, but it's now more than the standard and
it's kind
of unique in the available tools for the standard itself - but most of them
are pretty specialized and different to each other, probably unavoidably).

ggplot2 and tidygraph still have a long way to go, but in there I believe
is a language not just of graphics and of converting to and from igraph,
but of flexible ways of specifying how spatial data are constructed,
analysed
and interacted with.
Ggplot builds geometries, but it has no output other than a plot, and the
object that can
modify that plot. Select, group-by, arrange, filter, nest, and normalization
into multiple tables are all part of the powerful ways in which raw data
can be given structure. "Spatial" is just one small part of that structure,
and
unfortunately the 'geographic spatial', the 2D optimizations baked in
by decades old GIS practice seems to have the most sway in discussions.

This community could find a way to bake-in the geometries from ggplot
constructions as sf objects, i.e. convert gg-plot into sf, not into a plot.
I think that's a
digestable project that would really provide some valuable insights.
Interesting that you mention this, I agree it's a crux area in modern
spatial,
tracking has been neglected for years but it's starting to be looked at.

 I don't see this as terribly difficult, in fact we do this routinely
for all kinds of tracking data both as-measured and modelled. The
difficulty is having
access to the data as well as the motivation in the same place. We have
in-house tools
that sit on large collections of remotely sensed time series data:

https://github.com/AustralianAntarcticDataCentre/raadsync

When you have both the motivation (lots of animal tracking and voyage data)
and
the data on hand it's relatively easy (we have a ready audience at in
Southern Ocean
ecosystems research here).  First step is to write a read-data tool as a
function
of time. Then the problem of matching tracking data to the right space-time
window is easily broken down into the intervals between the time
series of environmental data. At that level, you can employ interpolation
between the layers,
aggregation to compromise resolution with coverage, and even
bring user-defined calculations to the smallish windows to calculate
derived products (rugosity, slope, distance to thresholds etc.)

In our modelled track outputs with tripEstimation/SGAT and bsam[1] we can
use the more probabilistic
estimates of location as more nuanced "query buckets" against the
same environmental data, it's the same issue with matching time with
all the different options you might want. What is hard for us is to share
the tools,
because first you need to arrange access to quite a lot of data, before
you get to see the magic in action.

I do think that stars provides a great opportunity though, and my thoughts
along
those lines are here:

https://github.com/mdsumner/stars.use.cases/blob/master/Use-cases.rmd#animal-tracking-mcmc-estimation


[1] https://CRAN.R-project.org/package=bsam
I don't think we've even started with the housekeeping here. Some early
thoughts are here:

http://rpubs.com/cyclemumner/sc-rationale

I'm interested to help find ways to nest the spdep idioms in "tidy" ways,
but I'm also not sure
how valuable that is yet. When you convert the structures to primitives the
edge and vertex relations
are inherent already, and it's a mater of standard database
join/select/filter idioms
to link things up:

http://rpubs.com/cyclemumner/neighbours01

http://rpubs.com/cyclemumner/neighbours02

(those examples use https://github.com/mdsumner/scsf for the PRIMITIVE
model
but it's really too raw to use yet). This is all fine and dandy, but then
you still have a list-bag
- i.e. database  - of tables with no strong idioms for treating them as a
single object -
that's the real loftier goal  of tidygraph, it's certainly a clear goal of
the tidyverse to be a
general master of data).

Is this a serious option worth pursuing for spdep? I don't know, but I'm
pursuing it for my own
reasons as detailed above. I don't think modernizing spdep would be that
difficult.  At the least the limitations and
obstacles faced when using it  should be described.

I'm happy to help and I might have a go at it at some point, and I'd
encourage
anyone to get in and try it out.With rlang and the imminent release of the
greatly
improved dplyr it's a good time to start.

Cheers, Mike.

  
    
#
Interesting discussion, thanks for bringing this up here. Some ideas:

I agree that there's more than one answer to the question what is tidy
in the context of analysing spatial data. Compared to the S4 classes in
sp, with the simplicity of S3 and data.frame's, there's also a certain
risk/loss, two examples came up the last few days in [1] and [2].

I don't think that the risks of adding neighbourhood lists or sparse
weights matrices to sf/data.frame objects are really large - right now
I'd guess (untried!) it is also possible to create nb lists, then subset
data and this list or sparse matrix, and end up with a meaningless mess
relatively unnoticed (some luck needed).

It might be a good idea to give these lists a certain class (right now
`st_intersects` or `st_rook` return an unclassed list), and modify the
`filter.sf` method such that it will warn when these objects are
subsetted - in which case they would loose their meaning.

I'm not exactly sure how to pack a sparse weights matrix in a
list-column, but that can't be hard; somehow I feel that sparse matrices
from Matrix would not fit in, since they're not lists - convert on the fly?

[1] https://github.com/edzer/sfr/issues/343
[2] https://github.com/edzer/sfr/issues/340
On 16/05/17 14:27, Michael Sumner wrote: