Is anyone thinking about creating an adaptation of the `spdep` package that expects sf-class inputs and works well in a pipeline? I understand that there is skepticism about the wisdom of adopting the ?tidyverse? principles throughout the R package ecosystem, and I share the concern that an over-reliance on any single paradigm could reduce the resilience and diversity of the system as a whole. That said, I believe that the enthusiastic adoption of the `sf` package and the package's connections with widely-used tidyverse packages like `dplyr` and `ggplot2` may result in increased demand for sf-friendly spatial analysis tools. As an amateur who recently started using R as my primary GIS tool, it seems like the tidyverse's preference for dataframes, S3 objects, list columns, and pipeline workflows would be well-suited to the field of spatial analysis. Are there some fundamental reasons why the `spdep` tools cannot (or should not) be adapted to the tidyverse "dialect"? Let me put the question in the context of an actual analysis: in February 2017, the pop culture infovis website The Pudding (https://pudding.cool/) published an analysis of regional preferences for Oscar-nominated films in the US (https://pudding.cool/2017/02/oscars_so_mapped/). A few days ago, the author posted a tutorial explaining the method of ?regional smoothing? used to create the article?s choropleths ( https://pudding.cool/process/regional_smoothing/). The method relies on several `spdep` functions ( https://github.com/polygraph-cool/smoothing_tutorial/blob/master/smoothing_tutorial.R). In the code below, I provide reprex with a smaller dataset included in the `sf` package: library(sf) library(spdep) nc <- st_read(system.file("shape/nc.shp", package = "sf")) # North Carolina counties nc_shp <- as(nc,'Spatial') coords <- coordinates(nc_shp) IDs<-row.names(as(nc_shp, "data.frame")) knn5 <- knn2nb(knearneigh(coords, k = 5), row.names = IDs) # find the nearest neighbors for each county knn5 <- include.self(knn5) localGvalues <- localG(x = as.numeric(nc_shp at data$NWBIR74), listw = nb2listw(knn5, style = "B"), zero.policy = TRUE) # calculate the G scores localGvalues <- round(localGvalues,3) nc_shp at data$LOCAL_G <- as.numeric(localGvalues) p1 <- spplot(nc_shp, c('NWBIR74')) p2 <- spplot(nc_shp, c('LOCAL_G')) plot(p1, split=c(1,1,2,2), more=TRUE) plot(p2, split=c(1,2,2,2), more=TRUE) Here?s what I imagine that would look like in a tidyverse pipeline (please note that this code is for illustrative purposes and will not run): library(tidyverse) library(purrr) library(sf) library(sfdep) # this package doesn't exist (yet) nc <- st_read(system.file("shape/nc.shp", package = "sf")) nc_g <- nc %>% mutate(KNN = map(.x = geometry, ~ sfdep::st_knn(.x, k = 5, include.self = TRUE)), # find the nearest neighbors for each county NB_LIST = map(.x = KNN, ~ sfdep::st_nb_list(.x, style = 'B')), # make a list of the neighbors using the binary method LOCAL_G = sfdep::st_localG(x = NWBIR74, listw = NB_LIST, zero.policy = TRUE), # calculate the G scores LOCAL_G = round(LOCAL_G,3)) We can see that the (hypothetical) tidyverse version reduces the amount of intermediate objects and wraps the creation of the G scores into a single code chunk with clear steps. I'd be grateful to hear from the users and developers of the `spdep` and `sf` packages about this topic! Tiernan Martin
Is a simple feature -friendly version of spdep being developed?
8 messages · Tiernan Martin, Roger Bivand, Michael Sumner +1 more
1 day later
On Fri, 12 May 2017, Tiernan Martin wrote:
Is anyone thinking about creating an adaptation of the `spdep` package that expects sf-class inputs and works well in a pipeline?
I assume that "this is not a pipeline" is a joke for insiders (in the pipe)? No, there is your issue on the sfr github repository that is relevant for contiguous neighbours, but not beyond that: https://github.com/edzer/sfr/issues/234 An sf is a data.frame, and as such should "just work", like "Spatial" objects have, in formula/data settings. The problem is (of course) the weights matrix. Should it be a list column (each row has a list nesting two lists, first indices, second non-zero weights), or remain separate as it has been for 20 years, or become a column-oriented representation (Matrix package) - a nested list like a list column or a listw obeject is row-oriented. I had started thinking about using a sparse column-oriented representation, but have not implemented functions accepting them instead of listw objects. I am very cautious about creating classes for data that were data.frame, then sf, and then have the weights built-in. In the simple case it would work, but all you have to do is re-order the rows and the link between the neighbour ids and row order breaks down; the same applies to subsetting. The problems to solve first are related the workflows, and easiest to look at in the univariate case (Moran, Geary, join-count, Mantel, G, ...) for global and local tests. I think that all or almost all of the NSE verbs will cause chaos (mutate, select, group) once weights have been created. If there is a way to bind the neighour IDs to the input sf object rows, a list column might be possible, but we have to permit multiple such columns (Queen, Rook, ...), and ensure that subsetting and row-reordering keep the graph relations intact (and their modifications, like row-standardisation) If you've seen any examples of how tidy relates to graphs (adjacency lists are like nb objects), we could learn from that. How should we go about this technically? spdep is on R-Forge and is happy there. Its present functionality has to be maintained as it stands, it has too many reverse dependencies to break. Should it be re-written from scratch (with more sparse matrices internally for computation)? Your example creates weights on the fly for a local G* map, but has n=100, not say n=90000 (LA census blocks). Using the sf_ geos based methods does not use STRtrees, which cut the time needed to find contigious neighbours from days to seconds. We ought to pre-compute weights, but this messes up the flow of data, because a second lot of data (the weights) have to enter the pipe, and be matched with the row-ids. We'd need proof of concept with realistically sized data sets (not yet NY taxis, but maybe later ...). spdep started as spweights, sptests and spdep, and the first two got folded into the third when stable. If weights are the first thing to go for, sfweights is where to go first (and port the STRtree envelope intersections for contiguities). It could define the new classes, and tests and modelling would use them. Thanks for starting this discussion. Roger
I understand that there is skepticism about the wisdom of adopting the ?tidyverse? principles throughout the R package ecosystem, and I share the concern that an over-reliance on any single paradigm could reduce the resilience and diversity of the system as a whole. That said, I believe that the enthusiastic adoption of the `sf` package and the package's connections with widely-used tidyverse packages like `dplyr` and `ggplot2` may result in increased demand for sf-friendly spatial analysis tools. As an amateur who recently started using R as my primary GIS tool, it seems like the tidyverse's preference for dataframes, S3 objects, list columns, and pipeline workflows would be well-suited to the field of spatial analysis. Are there some fundamental reasons why the `spdep` tools cannot (or should not) be adapted to the tidyverse "dialect"? Let me put the question in the context of an actual analysis: in February 2017, the pop culture infovis website The Pudding (https://pudding.cool/) published an analysis of regional preferences for Oscar-nominated films in the US (https://pudding.cool/2017/02/oscars_so_mapped/). A few days ago, the author posted a tutorial explaining the method of ?regional smoothing? used to create the article?s choropleths ( https://pudding.cool/process/regional_smoothing/). The method relies on several `spdep` functions ( https://github.com/polygraph-cool/smoothing_tutorial/blob/master/smoothing_tutorial.R). In the code below, I provide reprex with a smaller dataset included in the `sf` package: library(sf) library(spdep) nc <- st_read(system.file("shape/nc.shp", package = "sf")) # North Carolina counties nc_shp <- as(nc,'Spatial') coords <- coordinates(nc_shp) IDs<-row.names(as(nc_shp, "data.frame")) knn5 <- knn2nb(knearneigh(coords, k = 5), row.names = IDs) # find the nearest neighbors for each county knn5 <- include.self(knn5) localGvalues <- localG(x = as.numeric(nc_shp at data$NWBIR74), listw = nb2listw(knn5, style = "B"), zero.policy = TRUE) # calculate the G scores localGvalues <- round(localGvalues,3) nc_shp at data$LOCAL_G <- as.numeric(localGvalues) p1 <- spplot(nc_shp, c('NWBIR74')) p2 <- spplot(nc_shp, c('LOCAL_G')) plot(p1, split=c(1,1,2,2), more=TRUE) plot(p2, split=c(1,2,2,2), more=TRUE) Here?s what I imagine that would look like in a tidyverse pipeline (please note that this code is for illustrative purposes and will not run): library(tidyverse) library(purrr) library(sf) library(sfdep) # this package doesn't exist (yet) nc <- st_read(system.file("shape/nc.shp", package = "sf")) nc_g <- nc %>% mutate(KNN = map(.x = geometry, ~ sfdep::st_knn(.x, k = 5, include.self = TRUE)), # find the nearest neighbors for each county NB_LIST = map(.x = KNN, ~ sfdep::st_nb_list(.x, style = 'B')), # make a list of the neighbors using the binary method LOCAL_G = sfdep::st_localG(x = NWBIR74, listw = NB_LIST, zero.policy = TRUE), # calculate the G scores LOCAL_G = round(LOCAL_G,3)) We can see that the (hypothetical) tidyverse version reduces the amount of intermediate objects and wraps the creation of the G scores into a single code chunk with clear steps. I'd be grateful to hear from the users and developers of the `spdep` and `sf` packages about this topic! Tiernan Martin [[alternative HTML version deleted]]
_______________________________________________ R-sig-Geo mailing list R-sig-Geo at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Roger Bivand Department of Economics, Norwegian School of Economics, Helleveien 30, N-5045 Bergen, Norway. voice: +47 55 95 93 55; e-mail: Roger.Bivand at nhh.no Editor-in-Chief of The R Journal, https://journal.r-project.org/index.html http://orcid.org/0000-0003-2392-6140 https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
You?re right about one thing: those of us introduced to the pipe in our formative years find it hard to escape (everything is better here inside the pipe - right?) ;) Thank you for your very informative response. Before reading it, I had a vague inkling that matrices were part of the challenge of adapting these tools to work with simple features, but I also thought perhaps all this issue needed was a little nudge from a user like me. Your response made it clear that there is a series of technical decisions that need to be made, and in order to do that I imagine some exploratory testing is in order. Of course, it is possible that the data.frame + list-col foundation is just not well-suited to the needs of spatial weighting/testing tools, but I suspect that there is some valuable insight to be gained from exploring this possibility further. And as I stated in my first post, going forward I expect more users will be interested in seeing this sort of integration happen.
How should we go about this technically? spdep is on R-Forge and is happy there. Its present functionality has to be maintained as it stands, it
has
too many reverse dependencies to break. Should it be re-written from scratch (with more sparse matrices internally for computation)? ... We'd need proof of concept with realistically sized data sets (not yet NY taxis, but maybe later ...). spdep started as spweights, sptests and spdep, and the first two got folded into the third when stable. If
weights
are the first thing to go for, sfweights is where to go first (and port the STRtree envelope intersections for contiguities). It could define the new classes, and tests and modelling would use them.
Sounds to me like this package would need to be built from the ground up, possibly following a similar path to the `sp` development process as you mentioned. Maybe that presents a challenge with regards to resources (i.e., funding, time, etc.). Perhaps project this is a candidate for future proposals to the R Consortium or other funding sources. I hope that this post is useful in documenting user interest in seeing the `sf` protocol integrated into other spatial tools and workflows, and I look forward to hearing others' thoughts on the matter. Thanks again, Tiernan
On Sat, May 13, 2017 at 1:39 PM Roger Bivand <Roger.Bivand at nhh.no> wrote:
On Fri, 12 May 2017, Tiernan Martin wrote:
Is anyone thinking about creating an adaptation of the `spdep` package
that
expects sf-class inputs and works well in a pipeline?
I assume that "this is not a pipeline" is a joke for insiders (in the pipe)? No, there is your issue on the sfr github repository that is relevant for contiguous neighbours, but not beyond that: https://github.com/edzer/sfr/issues/234 An sf is a data.frame, and as such should "just work", like "Spatial" objects have, in formula/data settings. The problem is (of course) the weights matrix. Should it be a list column (each row has a list nesting two lists, first indices, second non-zero weights), or remain separate as it has been for 20 years, or become a column-oriented representation (Matrix package) - a nested list like a list column or a listw obeject is row-oriented. I had started thinking about using a sparse column-oriented representation, but have not implemented functions accepting them instead of listw objects. I am very cautious about creating classes for data that were data.frame, then sf, and then have the weights built-in. In the simple case it would work, but all you have to do is re-order the rows and the link between the neighbour ids and row order breaks down; the same applies to subsetting. The problems to solve first are related the workflows, and easiest to look at in the univariate case (Moran, Geary, join-count, Mantel, G, ...) for global and local tests. I think that all or almost all of the NSE verbs will cause chaos (mutate, select, group) once weights have been created. If there is a way to bind the neighour IDs to the input sf object rows, a list column might be possible, but we have to permit multiple such columns (Queen, Rook, ...), and ensure that subsetting and row-reordering keep the graph relations intact (and their modifications, like row-standardisation) If you've seen any examples of how tidy relates to graphs (adjacency lists are like nb objects), we could learn from that. How should we go about this technically? spdep is on R-Forge and is happy there. Its present functionality has to be maintained as it stands, it has too many reverse dependencies to break. Should it be re-written from scratch (with more sparse matrices internally for computation)? Your example creates weights on the fly for a local G* map, but has n=100, not say n=90000 (LA census blocks). Using the sf_ geos based methods does not use STRtrees, which cut the time needed to find contigious neighbours from days to seconds. We ought to pre-compute weights, but this messes up the flow of data, because a second lot of data (the weights) have to enter the pipe, and be matched with the row-ids. We'd need proof of concept with realistically sized data sets (not yet NY taxis, but maybe later ...). spdep started as spweights, sptests and spdep, and the first two got folded into the third when stable. If weights are the first thing to go for, sfweights is where to go first (and port the STRtree envelope intersections for contiguities). It could define the new classes, and tests and modelling would use them. Thanks for starting this discussion. Roger
I understand that there is skepticism about the wisdom of adopting the ?tidyverse? principles throughout the R package ecosystem, and I share
the
concern that an over-reliance on any single paradigm could reduce the resilience and diversity of the system as a whole. That said, I believe that the enthusiastic adoption of the `sf` package
and
the package's connections with widely-used tidyverse packages like
`dplyr`
and `ggplot2` may result in increased demand for sf-friendly spatial analysis tools. As an amateur who recently started using R as my primary GIS tool, it seems like the tidyverse's preference for dataframes, S3 objects, list columns, and pipeline workflows would be well-suited to the field of spatial analysis. Are there some fundamental reasons why the `spdep` tools cannot (or should not) be adapted to the tidyverse
"dialect"?
Let me put the question in the context of an actual analysis: in February 2017, the pop culture infovis website The Pudding (https://pudding.cool/
)
published an analysis of regional preferences for Oscar-nominated films
in
the US (https://pudding.cool/2017/02/oscars_so_mapped/). A few days ago, the author posted a tutorial explaining the method of ?regional
smoothing?
used to create the article?s choropleths ( https://pudding.cool/process/regional_smoothing/). The method relies on several `spdep` functions (
In the code below, I provide reprex with a smaller dataset included in
the
`sf` package:
library(sf)
library(spdep)
nc <- st_read(system.file("shape/nc.shp", package = "sf")) # North
Carolina counties
nc_shp <- as(nc,'Spatial')
coords <- coordinates(nc_shp)
IDs<-row.names(as(nc_shp, "data.frame"))
knn5 <- knn2nb(knearneigh(coords, k = 5), row.names = IDs) # find the
nearest neighbors for each county
knn5 <- include.self(knn5)
localGvalues <- localG(x = as.numeric(nc_shp at data$NWBIR74), listw =
nb2listw(knn5, style = "B"), zero.policy = TRUE) # calculate the G scores
localGvalues <- round(localGvalues,3)
nc_shp at data$LOCAL_G <- as.numeric(localGvalues)
p1 <- spplot(nc_shp, c('NWBIR74'))
p2 <- spplot(nc_shp, c('LOCAL_G'))
plot(p1, split=c(1,1,2,2), more=TRUE)
plot(p2, split=c(1,2,2,2), more=TRUE)
Here?s what I imagine that would look like in a tidyverse pipeline
(please
note that this code is for illustrative purposes and will not run):
library(tidyverse)
library(purrr)
library(sf)
library(sfdep) # this package doesn't exist (yet)
nc <- st_read(system.file("shape/nc.shp", package = "sf"))
nc_g <-
nc %>%
mutate(KNN = map(.x = geometry, ~ sfdep::st_knn(.x, k = 5, include.self
=
TRUE)), # find the nearest neighbors for each county
NB_LIST = map(.x = KNN, ~ sfdep::st_nb_list(.x, style = 'B')), #
make a list of the neighbors using the binary method
LOCAL_G = sfdep::st_localG(x = NWBIR74, listw = NB_LIST,
zero.policy = TRUE), # calculate the G scores
LOCAL_G = round(LOCAL_G,3))
We can see that the (hypothetical) tidyverse version reduces the amount
of
intermediate objects and wraps the creation of the G scores into a single
code chunk with clear steps.
I'd be grateful to hear from the users and developers of the `spdep` and
`sf` packages about this topic!
Tiernan Martin
[[alternative HTML version deleted]]
_______________________________________________ R-sig-Geo mailing list R-sig-Geo at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-geo
-- Roger Bivand Department of Economics, Norwegian School of Economics, Helleveien 30, N-5045 Bergen, Norway. voice: +47 55 95 93 55 <+47%2055%2095%2093%2055>; e-mail: Roger.Bivand at nhh.no Editor-in-Chief of The R Journal, https://journal.r-project.org/index.html http://orcid.org/0000-0003-2392-6140 https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
On Sun, 14 May 2017, Tiernan Martin wrote:
You?re right about one thing: those of us introduced to the pipe in our formative years find it hard to escape (everything is better here inside the pipe - right?) ;)
Pipes can be great if they don't involve extra inputs at a later stage in the pipe, but as: https://github.com/thomasp85/tidygraph suggests, it isn't very easy, and may be forced.
Thank you for your very informative response. Before reading it, I had a vague inkling that matrices were part of the challenge of adapting these tools to work with simple features, but I also thought perhaps all this issue needed was a little nudge from a user like me. Your response made it clear that there is a series of technical decisions that need to be made, and in order to do that I imagine some exploratory testing is in order. Of course, it is possible that the data.frame + list-col foundation is just not well-suited to the needs of spatial weighting/testing tools, but I suspect that there is some valuable insight to be gained from exploring this possibility further. And as I stated in my first post, going forward I expect more users will be interested in seeing this sort of integration happen.
I think that there are two tasks - one to create neighbour objects from sf objects, then another to see whether the costs of putting the spatial weights into a data.frame (or even more challenging, a tibble) to test for spatial autocorrelation or model spatial dependence. Should the weights be squirreled away inside the object flowing down the pipe? sf_object %>% add_weights(...) -> sf_object_with_weights moran.test(sf_object_with_weights, ...) geary.test(sf_object_with_weights, ...) or sf_object %>% add_weights(...) %>% moran.test(sf_object_with_weights, ...) -> output all in one (output is a htest object). It gets more complex in: lm.morantest(lm(formula, sf_object), sf_object_with_weights) as we need the weights and the lm output object.
How should we go about this technically? spdep is on R-Forge and is happy there. Its present functionality has to be maintained as it stands, it
has
too many reverse dependencies to break. Should it be re-written from scratch (with more sparse matrices internally for computation)? ... We'd need proof of concept with realistically sized data sets (not yet NY taxis, but maybe later ...). spdep started as spweights, sptests and spdep, and the first two got folded into the third when stable. If
weights
are the first thing to go for, sfweights is where to go first (and port the STRtree envelope intersections for contiguities). It could define the new classes, and tests and modelling would use them.
Sounds to me like this package would need to be built from the ground up, possibly following a similar path to the `sp` development process as you mentioned. Maybe that presents a challenge with regards to resources (i.e., funding, time, etc.). Perhaps project this is a candidate for future proposals to the R Consortium or other funding sources. I hope that this post is useful in documenting user interest in seeing the `sf` protocol integrated into other spatial tools and workflows, and I look forward to hearing others' thoughts on the matter.
Making spatial weights from sf objects is just expanding spdep, and is feasible. There is no point looking for funding, all it needs is knowledge of how things have been done in spdep. Contiguity, knn, distance, graph-based neighbours are all feasible. Extending this to adding weights to sf objects may be going too far. It seems slightly forced to - say - attach a sparse matrix to an sf geometry column as an attribute, or to add a neighbour nested list column when writing subsetting protection is very far from obvious. Like time series data (not time stamped observations, but real ordered time series), spatial data have implicit order that is tidy in a different way than tidy (graph dependencies between observation rows). Maybe mapped features with attached attribute data are "tidier" than a table, because they would preserve their relative positions? Roger
Thanks again, Tiernan On Sat, May 13, 2017 at 1:39 PM Roger Bivand <Roger.Bivand at nhh.no> wrote:
On Fri, 12 May 2017, Tiernan Martin wrote:
Is anyone thinking about creating an adaptation of the `spdep` package
that
expects sf-class inputs and works well in a pipeline?
I assume that "this is not a pipeline" is a joke for insiders (in the pipe)? No, there is your issue on the sfr github repository that is relevant for contiguous neighbours, but not beyond that: https://github.com/edzer/sfr/issues/234 An sf is a data.frame, and as such should "just work", like "Spatial" objects have, in formula/data settings. The problem is (of course) the weights matrix. Should it be a list column (each row has a list nesting two lists, first indices, second non-zero weights), or remain separate as it has been for 20 years, or become a column-oriented representation (Matrix package) - a nested list like a list column or a listw obeject is row-oriented. I had started thinking about using a sparse column-oriented representation, but have not implemented functions accepting them instead of listw objects. I am very cautious about creating classes for data that were data.frame, then sf, and then have the weights built-in. In the simple case it would work, but all you have to do is re-order the rows and the link between the neighbour ids and row order breaks down; the same applies to subsetting. The problems to solve first are related the workflows, and easiest to look at in the univariate case (Moran, Geary, join-count, Mantel, G, ...) for global and local tests. I think that all or almost all of the NSE verbs will cause chaos (mutate, select, group) once weights have been created. If there is a way to bind the neighour IDs to the input sf object rows, a list column might be possible, but we have to permit multiple such columns (Queen, Rook, ...), and ensure that subsetting and row-reordering keep the graph relations intact (and their modifications, like row-standardisation) If you've seen any examples of how tidy relates to graphs (adjacency lists are like nb objects), we could learn from that. How should we go about this technically? spdep is on R-Forge and is happy there. Its present functionality has to be maintained as it stands, it has too many reverse dependencies to break. Should it be re-written from scratch (with more sparse matrices internally for computation)? Your example creates weights on the fly for a local G* map, but has n=100, not say n=90000 (LA census blocks). Using the sf_ geos based methods does not use STRtrees, which cut the time needed to find contigious neighbours from days to seconds. We ought to pre-compute weights, but this messes up the flow of data, because a second lot of data (the weights) have to enter the pipe, and be matched with the row-ids. We'd need proof of concept with realistically sized data sets (not yet NY taxis, but maybe later ...). spdep started as spweights, sptests and spdep, and the first two got folded into the third when stable. If weights are the first thing to go for, sfweights is where to go first (and port the STRtree envelope intersections for contiguities). It could define the new classes, and tests and modelling would use them. Thanks for starting this discussion. Roger
I understand that there is skepticism about the wisdom of adopting the ?tidyverse? principles throughout the R package ecosystem, and I share
the
concern that an over-reliance on any single paradigm could reduce the resilience and diversity of the system as a whole. That said, I believe that the enthusiastic adoption of the `sf` package
and
the package's connections with widely-used tidyverse packages like
`dplyr`
and `ggplot2` may result in increased demand for sf-friendly spatial analysis tools. As an amateur who recently started using R as my primary GIS tool, it seems like the tidyverse's preference for dataframes, S3 objects, list columns, and pipeline workflows would be well-suited to the field of spatial analysis. Are there some fundamental reasons why the `spdep` tools cannot (or should not) be adapted to the tidyverse
"dialect"?
Let me put the question in the context of an actual analysis: in February 2017, the pop culture infovis website The Pudding (https://pudding.cool/
)
published an analysis of regional preferences for Oscar-nominated films
in
the US (https://pudding.cool/2017/02/oscars_so_mapped/). A few days ago, the author posted a tutorial explaining the method of ?regional
smoothing?
used to create the article?s choropleths ( https://pudding.cool/process/regional_smoothing/). The method relies on several `spdep` functions (
In the code below, I provide reprex with a smaller dataset included in
the
`sf` package:
library(sf)
library(spdep)
nc <- st_read(system.file("shape/nc.shp", package = "sf")) # North
Carolina counties
nc_shp <- as(nc,'Spatial')
coords <- coordinates(nc_shp)
IDs<-row.names(as(nc_shp, "data.frame"))
knn5 <- knn2nb(knearneigh(coords, k = 5), row.names = IDs) # find the
nearest neighbors for each county
knn5 <- include.self(knn5)
localGvalues <- localG(x = as.numeric(nc_shp at data$NWBIR74), listw =
nb2listw(knn5, style = "B"), zero.policy = TRUE) # calculate the G scores
localGvalues <- round(localGvalues,3)
nc_shp at data$LOCAL_G <- as.numeric(localGvalues)
p1 <- spplot(nc_shp, c('NWBIR74'))
p2 <- spplot(nc_shp, c('LOCAL_G'))
plot(p1, split=c(1,1,2,2), more=TRUE)
plot(p2, split=c(1,2,2,2), more=TRUE)
Here?s what I imagine that would look like in a tidyverse pipeline
(please
note that this code is for illustrative purposes and will not run):
library(tidyverse)
library(purrr)
library(sf)
library(sfdep) # this package doesn't exist (yet)
nc <- st_read(system.file("shape/nc.shp", package = "sf"))
nc_g <-
nc %>%
mutate(KNN = map(.x = geometry, ~ sfdep::st_knn(.x, k = 5, include.self
=
TRUE)), # find the nearest neighbors for each county
NB_LIST = map(.x = KNN, ~ sfdep::st_nb_list(.x, style = 'B')), #
make a list of the neighbors using the binary method
LOCAL_G = sfdep::st_localG(x = NWBIR74, listw = NB_LIST,
zero.policy = TRUE), # calculate the G scores
LOCAL_G = round(LOCAL_G,3))
We can see that the (hypothetical) tidyverse version reduces the amount
of
intermediate objects and wraps the creation of the G scores into a single
code chunk with clear steps.
I'd be grateful to hear from the users and developers of the `spdep` and
`sf` packages about this topic!
Tiernan Martin
[[alternative HTML version deleted]]
_______________________________________________ R-sig-Geo mailing list R-sig-Geo at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-geo
-- Roger Bivand Department of Economics, Norwegian School of Economics, Helleveien 30, N-5045 Bergen, Norway. voice: +47 55 95 93 55 <+47%2055%2095%2093%2055>; e-mail: Roger.Bivand at nhh.no Editor-in-Chief of The R Journal, https://journal.r-project.org/index.html http://orcid.org/0000-0003-2392-6140 https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
Roger Bivand Department of Economics, Norwegian School of Economics, Helleveien 30, N-5045 Bergen, Norway. voice: +47 55 95 93 55; e-mail: Roger.Bivand at nhh.no Editor-in-Chief of The R Journal, https://journal.r-project.org/index.html http://orcid.org/0000-0003-2392-6140 https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
Should the weights be squirreled away inside the object flowing down the pipe?
I'd propose that the that functions be written in such a way that the
neighbors, weights, and test results can be stored in list cols within the
data.frame.
This way multiple tests (and their inputs) can be stored in a single,
easily comprehensible, rectangular data object.
Here's some more pretend code to illustrate that concept with regard to sf:
nc %>%
group_by(NAME) %>%
nest %>%
mutate(
geometry = map(.x = data, ~.x[['geometry']]),
neighbors = map(.x = data, .f = my_nb), # creates a list col of
the neighbors
weights = map(.x = data, .f = add_weights), # creates a weights
object (class=wt, is this a sparse matrix in spdep?)
test_moran = map2(.x = data, .y = weights, .f = my_moran), #
creates list of Moran's I and sample kurtosis of x
test_geary = map2(.x = data, .y = weights, .f = my_geary) #
creates list of Geary's C and sample kurtosis of x
)
#> # A tibble: 100 x 7
#> NAME data geometry neighbors weights
test_moran test_geary
#> <fctr> <list> <list> <list> <list>
<list> <list>
#> 1 Ashe <tibble [1 x 14]> <simple_feature> <list [2]> <S3: wt>
<list [2]> <list [2]>
#> 2 Alleghany <tibble [1 x 14]> <simple_feature> <list [5]> <S3: wt>
<list [2]> <list [2]>
#> 3 Surry <tibble [1 x 14]> <simple_feature> <list [4]> <S3: wt>
<list [2]> <list [2]>
#> 4 Currituck <tibble [1 x 14]> <simple_feature> <list [2]> <S3: wt>
<list [2]> <list [2]>
#> 5 Northampton <tibble [1 x 14]> <simple_feature> <list [3]> <S3: wt>
<list [2]> <list [2]>
#> 6 Hertford <tibble [1 x 14]> <simple_feature> <list [6]> <S3: wt>
<list [2]> <list [2]>
#> 7 Camden <tibble [1 x 14]> <simple_feature> <list [5]> <S3: wt>
<list [2]> <list [2]>
#> 8 Gates <tibble [1 x 14]> <simple_feature> <list [2]> <S3: wt>
<list [2]> <list [2]>
#> 9 Warren <tibble [1 x 14]> <simple_feature> <list [5]> <S3: wt>
<list [2]> <list [2]>
#> 10 Stokes <tibble [1 x 14]> <simple_feature> <list [2]> <S3: wt>
<list [2]> <list [2]>
#> # ... with 90 more rows
Examples of this nested data.frame / list col approach can be found here:
http://r4ds.had.co.nz/many-models.html#nested-data
Extending this to adding weights to sf objects may be going too far. It seems slightly forced to - say - attach a sparse matrix to an sf geometry column as an attribute, or to add a neighbour nested list column when writing subsetting protection is very far from obvious. Like time series data (not time stamped observations, but real ordered time series), spatial data have implicit order that is tidy in a different way than tidy (graph dependencies between observation rows). Maybe mapped features with attached attribute data are "tidier" than a table, because they would preserve their relative positions?
This seems like the crux of the problem. Are you saying that the orthogonal "tidyness" of the tidyverse is fundamentally different from the underlying relationships of spatial data, and therefore there's a limit to the usefulness of tidyverse approach to spatial analysis?
On Mon, May 15, 2017 at 5:17 AM Roger Bivand <Roger.Bivand at nhh.no> wrote:
On Sun, 14 May 2017, Tiernan Martin wrote:
You?re right about one thing: those of us introduced to the pipe in our formative years find it hard to escape (everything is better here inside the pipe - right?) ;)
Pipes can be great if they don't involve extra inputs at a later stage in the pipe, but as: https://github.com/thomasp85/tidygraph suggests, it isn't very easy, and may be forced.
Thank you for your very informative response. Before reading it, I had a vague inkling that matrices were part of the challenge of adapting these tools to work with simple features, but I also thought perhaps all this issue needed was a little nudge from a user like me. Your response made
it
clear that there is a series of technical decisions that need to be made, and in order to do that I imagine some exploratory testing is in order.
Of
course, it is possible that the data.frame + list-col foundation is just not well-suited to the needs of spatial weighting/testing tools, but I suspect that there is some valuable insight to be gained from exploring this possibility further. And as I stated in my first post, going
forward I
expect more users will be interested in seeing this sort of integration happen.
I think that there are two tasks - one to create neighbour objects from sf objects, then another to see whether the costs of putting the spatial weights into a data.frame (or even more challenging, a tibble) to test for spatial autocorrelation or model spatial dependence. Should the weights be squirreled away inside the object flowing down the pipe? sf_object %>% add_weights(...) -> sf_object_with_weights moran.test(sf_object_with_weights, ...) geary.test(sf_object_with_weights, ...) or sf_object %>% add_weights(...) %>% moran.test(sf_object_with_weights, ...) -> output all in one (output is a htest object). It gets more complex in: lm.morantest(lm(formula, sf_object), sf_object_with_weights) as we need the weights and the lm output object.
How should we go about this technically? spdep is on R-Forge and is
happy
there. Its present functionality has to be maintained as it stands, it
has
too many reverse dependencies to break. Should it be re-written from scratch (with more sparse matrices internally for computation)? ... We'd need proof of concept with realistically sized data sets (not yet
NY
taxis, but maybe later ...). spdep started as spweights, sptests and spdep, and the first two got folded into the third when stable. If
weights
are the first thing to go for, sfweights is where to go first (and port the STRtree envelope intersections for contiguities). It could define
the
new classes, and tests and modelling would use them.
Sounds to me like this package would need to be built from the ground up, possibly following a similar path to the `sp` development process as you mentioned. Maybe that presents a challenge with regards to resources
(i.e.,
funding, time, etc.). Perhaps project this is a candidate for future proposals to the R Consortium or other funding sources. I hope that this post is useful in documenting user interest in seeing the `sf` protocol integrated into other spatial tools and workflows, and I look forward to hearing others' thoughts on the matter.
Making spatial weights from sf objects is just expanding spdep, and is feasible. There is no point looking for funding, all it needs is knowledge of how things have been done in spdep. Contiguity, knn, distance, graph-based neighbours are all feasible. Extending this to adding weights to sf objects may be going too far. It seems slightly forced to - say - attach a sparse matrix to an sf geometry column as an attribute, or to add a neighbour nested list column when writing subsetting protection is very far from obvious. Like time series data (not time stamped observations, but real ordered time series), spatial data have implicit order that is tidy in a different way than tidy (graph dependencies between observation rows). Maybe mapped features with attached attribute data are "tidier" than a table, because they would preserve their relative positions? Roger
Thanks again, Tiernan On Sat, May 13, 2017 at 1:39 PM Roger Bivand <Roger.Bivand at nhh.no>
wrote:
On Fri, 12 May 2017, Tiernan Martin wrote:
Is anyone thinking about creating an adaptation of the `spdep` package
that
expects sf-class inputs and works well in a pipeline?
I assume that "this is not a pipeline" is a joke for insiders (in the pipe)? No, there is your issue on the sfr github repository that is relevant
for
contiguous neighbours, but not beyond that: https://github.com/edzer/sfr/issues/234 An sf is a data.frame, and as such should "just work", like "Spatial" objects have, in formula/data settings. The problem is (of course) the weights matrix. Should it be a list column (each row has a list nesting two lists, first indices, second non-zero weights), or remain separate as it has been for 20 years, or become a column-oriented representation (Matrix package) -
a
nested list like a list column or a listw obeject is row-oriented. I had started thinking about using a sparse column-oriented representation,
but
have not implemented functions accepting them instead of listw objects. I am very cautious about creating classes for data that were data.frame, then sf, and then have the weights built-in. In the simple case it would work, but all you have to do is re-order the rows and the link between
the
neighbour ids and row order breaks down; the same applies to subsetting. The problems to solve first are related the workflows, and easiest to
look
at in the univariate case (Moran, Geary, join-count, Mantel, G, ...) for global and local tests. I think that all or almost all of the NSE verbs will cause chaos (mutate, select, group) once weights have been created. If there is a way to bind the neighour IDs to the input sf object rows,
a
list column might be possible, but we have to permit multiple such
columns
(Queen, Rook, ...), and ensure that subsetting and row-reordering keep
the
graph relations intact (and their modifications, like
row-standardisation)
If you've seen any examples of how tidy relates to graphs (adjacency
lists
are like nb objects), we could learn from that. How should we go about this technically? spdep is on R-Forge and is
happy
there. Its present functionality has to be maintained as it stands, it
has
too many reverse dependencies to break. Should it be re-written from scratch (with more sparse matrices internally for computation)? Your example creates weights on the fly for a local G* map, but has
n=100,
not say n=90000 (LA census blocks). Using the sf_ geos based methods
does
not use STRtrees, which cut the time needed to find contigious
neighbours
from days to seconds. We ought to pre-compute weights, but this messes
up
the flow of data, because a second lot of data (the weights) have to
enter
the pipe, and be matched with the row-ids. We'd need proof of concept with realistically sized data sets (not yet
NY
taxis, but maybe later ...). spdep started as spweights, sptests and spdep, and the first two got folded into the third when stable. If
weights
are the first thing to go for, sfweights is where to go first (and port the STRtree envelope intersections for contiguities). It could define
the
new classes, and tests and modelling would use them. Thanks for starting this discussion. Roger
I understand that there is skepticism about the wisdom of adopting the ?tidyverse? principles throughout the R package ecosystem, and I share
the
concern that an over-reliance on any single paradigm could reduce the resilience and diversity of the system as a whole. That said, I believe that the enthusiastic adoption of the `sf` package
and
the package's connections with widely-used tidyverse packages like
`dplyr`
and `ggplot2` may result in increased demand for sf-friendly spatial analysis tools. As an amateur who recently started using R as my
primary
GIS tool, it seems like the tidyverse's preference for dataframes, S3 objects, list columns, and pipeline workflows would be well-suited to
the
field of spatial analysis. Are there some fundamental reasons why the `spdep` tools cannot (or should not) be adapted to the tidyverse
"dialect"?
Let me put the question in the context of an actual analysis: in
February
2017, the pop culture infovis website The Pudding (
)
published an analysis of regional preferences for Oscar-nominated films
in
the US (https://pudding.cool/2017/02/oscars_so_mapped/). A few days
ago,
the author posted a tutorial explaining the method of ?regional
smoothing?
used to create the article?s choropleths ( https://pudding.cool/process/regional_smoothing/). The method relies on several `spdep` functions (
).
In the code below, I provide reprex with a smaller dataset included in
the
`sf` package:
library(sf)
library(spdep)
nc <- st_read(system.file("shape/nc.shp", package = "sf")) # North
Carolina counties
nc_shp <- as(nc,'Spatial')
coords <- coordinates(nc_shp)
IDs<-row.names(as(nc_shp, "data.frame"))
knn5 <- knn2nb(knearneigh(coords, k = 5), row.names = IDs) # find the
nearest neighbors for each county
knn5 <- include.self(knn5)
localGvalues <- localG(x = as.numeric(nc_shp at data$NWBIR74), listw =
nb2listw(knn5, style = "B"), zero.policy = TRUE) # calculate the G
scores
localGvalues <- round(localGvalues,3)
nc_shp at data$LOCAL_G <- as.numeric(localGvalues)
p1 <- spplot(nc_shp, c('NWBIR74'))
p2 <- spplot(nc_shp, c('LOCAL_G'))
plot(p1, split=c(1,1,2,2), more=TRUE)
plot(p2, split=c(1,2,2,2), more=TRUE)
Here?s what I imagine that would look like in a tidyverse pipeline
(please
note that this code is for illustrative purposes and will not run):
library(tidyverse)
library(purrr)
library(sf)
library(sfdep) # this package doesn't exist (yet)
nc <- st_read(system.file("shape/nc.shp", package = "sf"))
nc_g <-
nc %>%
mutate(KNN = map(.x = geometry, ~ sfdep::st_knn(.x, k = 5,
include.self
=
TRUE)), # find the nearest neighbors for each county
NB_LIST = map(.x = KNN, ~ sfdep::st_nb_list(.x, style =
'B')), #
make a list of the neighbors using the binary method
LOCAL_G = sfdep::st_localG(x = NWBIR74, listw = NB_LIST,
zero.policy = TRUE), # calculate the G scores
LOCAL_G = round(LOCAL_G,3))
We can see that the (hypothetical) tidyverse version reduces the amount
of
intermediate objects and wraps the creation of the G scores into a
single
code chunk with clear steps. I'd be grateful to hear from the users and developers of the `spdep`
and
`sf` packages about this topic!
Tiernan Martin
[[alternative HTML version deleted]]
_______________________________________________ R-sig-Geo mailing list R-sig-Geo at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-geo
-- Roger Bivand Department of Economics, Norwegian School of Economics, Helleveien 30, N-5045 Bergen, Norway. voice: +47 55 95 93 55 <+47%2055%2095%2093%2055>
<+47%2055%2095%2093%2055>; e-mail:
Roger.Bivand at nhh.no Editor-in-Chief of The R Journal,
-- Roger Bivand Department of Economics, Norwegian School of Economics, Helleveien 30, N-5045 Bergen, Norway. voice: +47 55 95 93 55 <+47%2055%2095%2093%2055>; e-mail: Roger.Bivand at nhh.no Editor-in-Chief of The R Journal, https://journal.r-project.org/index.html http://orcid.org/0000-0003-2392-6140 https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
On Tue, 16 May 2017, Tiernan Martin wrote:
Should the weights be squirreled away inside the object flowing down the pipe?
I'd propose that the that functions be written in such a way that the
neighbors, weights, and test results can be stored in list cols within the
data.frame.
This way multiple tests (and their inputs) can be stored in a single,
easily comprehensible, rectangular data object.
Here's some more pretend code to illustrate that concept with regard to sf:
nc %>%
group_by(NAME) %>%
nest %>%
mutate(
geometry = map(.x = data, ~.x[['geometry']]),
neighbors = map(.x = data, .f = my_nb), # creates a list col of
the neighbors
weights = map(.x = data, .f = add_weights), # creates a weights
object (class=wt, is this a sparse matrix in spdep?)
test_moran = map2(.x = data, .y = weights, .f = my_moran), #
creates list of Moran's I and sample kurtosis of x
test_geary = map2(.x = data, .y = weights, .f = my_geary) #
creates list of Geary's C and sample kurtosis of x
)
#> # A tibble: 100 x 7
#> NAME data geometry neighbors weights
test_moran test_geary
#> <fctr> <list> <list> <list> <list>
<list> <list>
#> 1 Ashe <tibble [1 x 14]> <simple_feature> <list [2]> <S3: wt>
<list [2]> <list [2]>
#> 2 Alleghany <tibble [1 x 14]> <simple_feature> <list [5]> <S3: wt>
<list [2]> <list [2]>
#> 3 Surry <tibble [1 x 14]> <simple_feature> <list [4]> <S3: wt>
<list [2]> <list [2]>
#> 4 Currituck <tibble [1 x 14]> <simple_feature> <list [2]> <S3: wt>
<list [2]> <list [2]>
#> 5 Northampton <tibble [1 x 14]> <simple_feature> <list [3]> <S3: wt>
<list [2]> <list [2]>
#> 6 Hertford <tibble [1 x 14]> <simple_feature> <list [6]> <S3: wt>
<list [2]> <list [2]>
#> 7 Camden <tibble [1 x 14]> <simple_feature> <list [5]> <S3: wt>
<list [2]> <list [2]>
#> 8 Gates <tibble [1 x 14]> <simple_feature> <list [2]> <S3: wt>
<list [2]> <list [2]>
#> 9 Warren <tibble [1 x 14]> <simple_feature> <list [5]> <S3: wt>
<list [2]> <list [2]>
#> 10 Stokes <tibble [1 x 14]> <simple_feature> <list [2]> <S3: wt>
<list [2]> <list [2]>
#> # ... with 90 more rows
Examples of this nested data.frame / list col approach can be found here:
http://r4ds.had.co.nz/many-models.html#nested-data
Thanks, interesting but really scary. You get numeric output for local Moran's I and local Geary, but what will happen next? All of these tests assume a correctly specified mean model. Without this assumption being met (absence of trends, absence of global spatial autocorrelation in the local test case, absence of missing right-hand-side variables and absence of incorrect functional forms for these, ...), a map pretending to invite "inference" of "hotspots" etc. - collections of observations with similar values of the local statistic, will most often be highly misleading. This isn't only about whether it is possible to do technically, but also about not making it simpler than reason would suggest. With a correct mean model, there may not be any (local) spatial autocorrelation left. See ?localmoran.sad and ?localmoran.exact, and their references, and McMillen (2003) (the McSpatial author). I won't even mention the need to adjust p.values for multiple comparisons ...
Extending this to adding weights to sf objects may be going too far. It seems slightly forced to - say - attach a sparse matrix to an sf geometry column as an attribute, or to add a neighbour nested list column when writing subsetting protection is very far from obvious. Like time series data (not time stamped observations, but real ordered time series), spatial data have implicit order that is tidy in a different way than tidy (graph dependencies between observation rows). Maybe mapped features with attached attribute data are "tidier" than a table, because they would preserve their relative positions?
This seems like the crux of the problem. Are you saying that the orthogonal "tidyness" of the tidyverse is fundamentally different from the underlying relationships of spatial data, and therefore there's a limit to the usefulness of tidyverse approach to spatial analysis?
Right, this is central and open. Spatial objects can be tidy in the tidyverse sense - sf shows this can be done. The open question is whether this extends to the analysis of the data. From looking around, it seems that time series and graphs suffer from the same kinds of questions, and it will be interesting to see what stars turns up for spatio-temporal data: https://github.com/edzer/stars I see that stars wisely does not include trip/trajectory data structures, but analysis might involve reading the array-based representation of spatio-temporal environmental drivers for an ensemble of buffers around trajectories (say animal movement), then modelling. Maybe: "everything should be as tidy as it can be but not tidier"? Roger
On Mon, May 15, 2017 at 5:17 AM Roger Bivand <Roger.Bivand at nhh.no> wrote:
On Sun, 14 May 2017, Tiernan Martin wrote:
You?re right about one thing: those of us introduced to the pipe in our formative years find it hard to escape (everything is better here inside the pipe - right?) ;)
Pipes can be great if they don't involve extra inputs at a later stage in the pipe, but as: https://github.com/thomasp85/tidygraph suggests, it isn't very easy, and may be forced.
Thank you for your very informative response. Before reading it, I had a vague inkling that matrices were part of the challenge of adapting these tools to work with simple features, but I also thought perhaps all this issue needed was a little nudge from a user like me. Your response made
it
clear that there is a series of technical decisions that need to be made, and in order to do that I imagine some exploratory testing is in order.
Of
course, it is possible that the data.frame + list-col foundation is just not well-suited to the needs of spatial weighting/testing tools, but I suspect that there is some valuable insight to be gained from exploring this possibility further. And as I stated in my first post, going
forward I
expect more users will be interested in seeing this sort of integration happen.
I think that there are two tasks - one to create neighbour objects from sf objects, then another to see whether the costs of putting the spatial weights into a data.frame (or even more challenging, a tibble) to test for spatial autocorrelation or model spatial dependence. Should the weights be squirreled away inside the object flowing down the pipe? sf_object %>% add_weights(...) -> sf_object_with_weights moran.test(sf_object_with_weights, ...) geary.test(sf_object_with_weights, ...) or sf_object %>% add_weights(...) %>% moran.test(sf_object_with_weights, ...) -> output all in one (output is a htest object). It gets more complex in: lm.morantest(lm(formula, sf_object), sf_object_with_weights) as we need the weights and the lm output object.
How should we go about this technically? spdep is on R-Forge and is
happy
there. Its present functionality has to be maintained as it stands, it
has
too many reverse dependencies to break. Should it be re-written from scratch (with more sparse matrices internally for computation)? ... We'd need proof of concept with realistically sized data sets (not yet
NY
taxis, but maybe later ...). spdep started as spweights, sptests and spdep, and the first two got folded into the third when stable. If
weights
are the first thing to go for, sfweights is where to go first (and port the STRtree envelope intersections for contiguities). It could define
the
new classes, and tests and modelling would use them.
Sounds to me like this package would need to be built from the ground up, possibly following a similar path to the `sp` development process as you mentioned. Maybe that presents a challenge with regards to resources
(i.e.,
funding, time, etc.). Perhaps project this is a candidate for future proposals to the R Consortium or other funding sources. I hope that this post is useful in documenting user interest in seeing the `sf` protocol integrated into other spatial tools and workflows, and I look forward to hearing others' thoughts on the matter.
Making spatial weights from sf objects is just expanding spdep, and is feasible. There is no point looking for funding, all it needs is knowledge of how things have been done in spdep. Contiguity, knn, distance, graph-based neighbours are all feasible. Extending this to adding weights to sf objects may be going too far. It seems slightly forced to - say - attach a sparse matrix to an sf geometry column as an attribute, or to add a neighbour nested list column when writing subsetting protection is very far from obvious. Like time series data (not time stamped observations, but real ordered time series), spatial data have implicit order that is tidy in a different way than tidy (graph dependencies between observation rows). Maybe mapped features with attached attribute data are "tidier" than a table, because they would preserve their relative positions? Roger
Thanks again, Tiernan On Sat, May 13, 2017 at 1:39 PM Roger Bivand <Roger.Bivand at nhh.no>
wrote:
On Fri, 12 May 2017, Tiernan Martin wrote:
Is anyone thinking about creating an adaptation of the `spdep` package
that
expects sf-class inputs and works well in a pipeline?
I assume that "this is not a pipeline" is a joke for insiders (in the pipe)? No, there is your issue on the sfr github repository that is relevant
for
contiguous neighbours, but not beyond that: https://github.com/edzer/sfr/issues/234 An sf is a data.frame, and as such should "just work", like "Spatial" objects have, in formula/data settings. The problem is (of course) the weights matrix. Should it be a list column (each row has a list nesting two lists, first indices, second non-zero weights), or remain separate as it has been for 20 years, or become a column-oriented representation (Matrix package) -
a
nested list like a list column or a listw obeject is row-oriented. I had started thinking about using a sparse column-oriented representation,
but
have not implemented functions accepting them instead of listw objects. I am very cautious about creating classes for data that were data.frame, then sf, and then have the weights built-in. In the simple case it would work, but all you have to do is re-order the rows and the link between
the
neighbour ids and row order breaks down; the same applies to subsetting. The problems to solve first are related the workflows, and easiest to
look
at in the univariate case (Moran, Geary, join-count, Mantel, G, ...) for global and local tests. I think that all or almost all of the NSE verbs will cause chaos (mutate, select, group) once weights have been created. If there is a way to bind the neighour IDs to the input sf object rows,
a
list column might be possible, but we have to permit multiple such
columns
(Queen, Rook, ...), and ensure that subsetting and row-reordering keep
the
graph relations intact (and their modifications, like
row-standardisation)
If you've seen any examples of how tidy relates to graphs (adjacency
lists
are like nb objects), we could learn from that. How should we go about this technically? spdep is on R-Forge and is
happy
there. Its present functionality has to be maintained as it stands, it
has
too many reverse dependencies to break. Should it be re-written from scratch (with more sparse matrices internally for computation)? Your example creates weights on the fly for a local G* map, but has
n=100,
not say n=90000 (LA census blocks). Using the sf_ geos based methods
does
not use STRtrees, which cut the time needed to find contigious
neighbours
from days to seconds. We ought to pre-compute weights, but this messes
up
the flow of data, because a second lot of data (the weights) have to
enter
the pipe, and be matched with the row-ids. We'd need proof of concept with realistically sized data sets (not yet
NY
taxis, but maybe later ...). spdep started as spweights, sptests and spdep, and the first two got folded into the third when stable. If
weights
are the first thing to go for, sfweights is where to go first (and port the STRtree envelope intersections for contiguities). It could define
the
new classes, and tests and modelling would use them. Thanks for starting this discussion. Roger
I understand that there is skepticism about the wisdom of adopting the ?tidyverse? principles throughout the R package ecosystem, and I share
the
concern that an over-reliance on any single paradigm could reduce the resilience and diversity of the system as a whole. That said, I believe that the enthusiastic adoption of the `sf` package
and
the package's connections with widely-used tidyverse packages like
`dplyr`
and `ggplot2` may result in increased demand for sf-friendly spatial analysis tools. As an amateur who recently started using R as my
primary
GIS tool, it seems like the tidyverse's preference for dataframes, S3 objects, list columns, and pipeline workflows would be well-suited to
the
field of spatial analysis. Are there some fundamental reasons why the `spdep` tools cannot (or should not) be adapted to the tidyverse
"dialect"?
Let me put the question in the context of an actual analysis: in
February
2017, the pop culture infovis website The Pudding (
)
published an analysis of regional preferences for Oscar-nominated films
in
the US (https://pudding.cool/2017/02/oscars_so_mapped/). A few days
ago,
the author posted a tutorial explaining the method of ?regional
smoothing?
used to create the article?s choropleths ( https://pudding.cool/process/regional_smoothing/). The method relies on several `spdep` functions (
).
In the code below, I provide reprex with a smaller dataset included in
the
`sf` package:
library(sf)
library(spdep)
nc <- st_read(system.file("shape/nc.shp", package = "sf")) # North
Carolina counties
nc_shp <- as(nc,'Spatial')
coords <- coordinates(nc_shp)
IDs<-row.names(as(nc_shp, "data.frame"))
knn5 <- knn2nb(knearneigh(coords, k = 5), row.names = IDs) # find the
nearest neighbors for each county
knn5 <- include.self(knn5)
localGvalues <- localG(x = as.numeric(nc_shp at data$NWBIR74), listw =
nb2listw(knn5, style = "B"), zero.policy = TRUE) # calculate the G
scores
localGvalues <- round(localGvalues,3)
nc_shp at data$LOCAL_G <- as.numeric(localGvalues)
p1 <- spplot(nc_shp, c('NWBIR74'))
p2 <- spplot(nc_shp, c('LOCAL_G'))
plot(p1, split=c(1,1,2,2), more=TRUE)
plot(p2, split=c(1,2,2,2), more=TRUE)
Here?s what I imagine that would look like in a tidyverse pipeline
(please
note that this code is for illustrative purposes and will not run):
library(tidyverse)
library(purrr)
library(sf)
library(sfdep) # this package doesn't exist (yet)
nc <- st_read(system.file("shape/nc.shp", package = "sf"))
nc_g <-
nc %>%
mutate(KNN = map(.x = geometry, ~ sfdep::st_knn(.x, k = 5,
include.self
=
TRUE)), # find the nearest neighbors for each county
NB_LIST = map(.x = KNN, ~ sfdep::st_nb_list(.x, style =
'B')), #
make a list of the neighbors using the binary method
LOCAL_G = sfdep::st_localG(x = NWBIR74, listw = NB_LIST,
zero.policy = TRUE), # calculate the G scores
LOCAL_G = round(LOCAL_G,3))
We can see that the (hypothetical) tidyverse version reduces the amount
of
intermediate objects and wraps the creation of the G scores into a
single
code chunk with clear steps. I'd be grateful to hear from the users and developers of the `spdep`
and
`sf` packages about this topic!
Tiernan Martin
[[alternative HTML version deleted]]
_______________________________________________ R-sig-Geo mailing list R-sig-Geo at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-geo
-- Roger Bivand Department of Economics, Norwegian School of Economics, Helleveien 30, N-5045 Bergen, Norway. voice: +47 55 95 93 55 <+47%2055%2095%2093%2055>
<+47%2055%2095%2093%2055>; e-mail:
Roger.Bivand at nhh.no Editor-in-Chief of The R Journal,
-- Roger Bivand Department of Economics, Norwegian School of Economics, Helleveien 30, N-5045 Bergen, Norway. voice: +47 55 95 93 55 <+47%2055%2095%2093%2055>; e-mail: Roger.Bivand at nhh.no Editor-in-Chief of The R Journal, https://journal.r-project.org/index.html http://orcid.org/0000-0003-2392-6140 https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
Roger Bivand Department of Economics, Norwegian School of Economics, Helleveien 30, N-5045 Bergen, Norway. voice: +47 55 95 93 55; e-mail: Roger.Bivand at nhh.no Editor-in-Chief of The R Journal, https://journal.r-project.org/index.html http://orcid.org/0000-0003-2392-6140 https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
Hello, I'm following this conversation with great interest. There's a lot here for me to work with, it's an active pursuit and I don't have much to offer yet but I have a couple of inline responses below. I've learnt a lot about spdep from this discussion, something I've meant to explore for a long time but my work has never had the same modelling focus).
On Tue, 16 May 2017 at 18:25 Roger Bivand <Roger.Bivand at nhh.no> wrote:
On Tue, 16 May 2017, Tiernan Martin wrote:
Should the weights be squirreled away inside the object flowing down the pipe?
I'd propose that the that functions be written in such a way that the neighbors, weights, and test results can be stored in list cols within
the
data.frame. This way multiple tests (and their inputs) can be stored in a single, easily comprehensible, rectangular data object. Here's some more pretend code to illustrate that concept with regard to
sf:
nc %>%
group_by(NAME) %>%
nest %>%
mutate(
geometry = map(.x = data, ~.x[['geometry']]),
neighbors = map(.x = data, .f = my_nb), # creates a list col of
the neighbors
weights = map(.x = data, .f = add_weights), # creates a weights
object (class=wt, is this a sparse matrix in spdep?)
test_moran = map2(.x = data, .y = weights, .f = my_moran), #
creates list of Moran's I and sample kurtosis of x
test_geary = map2(.x = data, .y = weights, .f = my_geary) #
creates list of Geary's C and sample kurtosis of x
)
#> # A tibble: 100 x 7
#> NAME data geometry neighbors weights
test_moran test_geary
#> <fctr> <list> <list> <list> <list>
<list> <list>
#> 1 Ashe <tibble [1 x 14]> <simple_feature> <list [2]> <S3: wt>
<list [2]> <list [2]>
#> 2 Alleghany <tibble [1 x 14]> <simple_feature> <list [5]> <S3: wt>
<list [2]> <list [2]>
#> 3 Surry <tibble [1 x 14]> <simple_feature> <list [4]> <S3: wt>
<list [2]> <list [2]>
#> 4 Currituck <tibble [1 x 14]> <simple_feature> <list [2]> <S3: wt>
<list [2]> <list [2]>
#> 5 Northampton <tibble [1 x 14]> <simple_feature> <list [3]> <S3: wt>
<list [2]> <list [2]>
#> 6 Hertford <tibble [1 x 14]> <simple_feature> <list [6]> <S3: wt>
<list [2]> <list [2]>
#> 7 Camden <tibble [1 x 14]> <simple_feature> <list [5]> <S3: wt>
<list [2]> <list [2]>
#> 8 Gates <tibble [1 x 14]> <simple_feature> <list [2]> <S3: wt>
<list [2]> <list [2]>
#> 9 Warren <tibble [1 x 14]> <simple_feature> <list [5]> <S3: wt>
<list [2]> <list [2]>
#> 10 Stokes <tibble [1 x 14]> <simple_feature> <list [2]> <S3: wt>
<list [2]> <list [2]>
#> # ... with 90 more rows
Examples of this nested data.frame / list col approach can be found here:
http://r4ds.had.co.nz/many-models.html#nested-data
Thanks, interesting but really scary. You get numeric output for local Moran's I and local Geary, but what will happen next? All of these tests assume a correctly specified mean model. Without this assumption being met (absence of trends, absence of global spatial autocorrelation in the local test case, absence of missing right-hand-side variables and absence of incorrect functional forms for these, ...), a map pretending to invite "inference" of "hotspots" etc. - collections of observations with similar values of the local statistic, will most often be highly misleading. This isn't only about whether it is possible to do technically, but also about not making it simpler than reason would suggest. With a correct mean model, there may not be any (local) spatial autocorrelation left. See ?localmoran.sad and ?localmoran.exact, and their references, and McMillen (2003) (the McSpatial author). I won't even mention the need to adjust p.values for multiple comparisons ...
Extending this to adding weights to sf objects may be going too far. It seems slightly forced to - say - attach a sparse matrix to an sf
geometry
column as an attribute, or to add a neighbour nested list column when writing subsetting protection is very far from obvious. Like time series data (not time stamped observations, but real ordered time series), spatial data have implicit order that is tidy in a different way than
tidy
(graph dependencies between observation rows). Maybe mapped features
with
attached attribute data are "tidier" than a table, because they would preserve their relative positions?
This seems like the crux of the problem. Are you saying that the
orthogonal
"tidyness" of the tidyverse is fundamentally different from the
underlying
relationships of spatial data, and therefore there's a limit to the usefulness of tidyverse approach to spatial analysis?
Right, this is central and open. Spatial objects can be tidy in the tidyverse sense - sf shows this can be done. The open question is whether this extends to the analysis of the data.
I think the structures are still wide open, unless you think the goals of sf encompass everything. The simple features standard is one corner of the available ways to structure spatial data. It pointedly ignores relations between objects, and heavily relies on 2D-specific optimizations. All the topology is done "on the fly" and is treated as expendable. I personally want structures that treat simple features and similar formats as expendable, but that maintain the topology as a first-class citizen. I don't say this because I think simple features is not important, it's just not my main focus. I want time, depth, temperature, salinity, anything and everything storable on features, parts, edges, vertices and nodes. Lots of tools do this or aspects of it, and none of them can be shoe-horned into simple features, and they don't have a unifying framework. (It's called "mesh structures", or the simplicial complex, or "a database" in other contexts, but none of those are really enough). (I'm dismayed by GIS ideas when they are spoken about as if it's the end of the story. Sf is really great, but it's now more than the standard and it's kind of unique in the available tools for the standard itself - but most of them are pretty specialized and different to each other, probably unavoidably). ggplot2 and tidygraph still have a long way to go, but in there I believe is a language not just of graphics and of converting to and from igraph, but of flexible ways of specifying how spatial data are constructed, analysed and interacted with. Ggplot builds geometries, but it has no output other than a plot, and the object that can modify that plot. Select, group-by, arrange, filter, nest, and normalization into multiple tables are all part of the powerful ways in which raw data can be given structure. "Spatial" is just one small part of that structure, and unfortunately the 'geographic spatial', the 2D optimizations baked in by decades old GIS practice seems to have the most sway in discussions. This community could find a way to bake-in the geometries from ggplot constructions as sf objects, i.e. convert gg-plot into sf, not into a plot. I think that's a digestable project that would really provide some valuable insights.
From looking around, it seems that time series and graphs suffer from the same kinds of questions, and it will be interesting to see what stars turns up for spatio-temporal data: https://github.com/edzer/stars I see that stars wisely does not include trip/trajectory data structures, but analysis might involve reading the array-based representation of spatio-temporal environmental drivers for an ensemble of buffers around trajectories (say animal movement), then modelling.
Interesting that you mention this, I agree it's a crux area in modern spatial, tracking has been neglected for years but it's starting to be looked at. I don't see this as terribly difficult, in fact we do this routinely for all kinds of tracking data both as-measured and modelled. The difficulty is having access to the data as well as the motivation in the same place. We have in-house tools that sit on large collections of remotely sensed time series data: https://github.com/AustralianAntarcticDataCentre/raadsync When you have both the motivation (lots of animal tracking and voyage data) and the data on hand it's relatively easy (we have a ready audience at in Southern Ocean ecosystems research here). First step is to write a read-data tool as a function of time. Then the problem of matching tracking data to the right space-time window is easily broken down into the intervals between the time series of environmental data. At that level, you can employ interpolation between the layers, aggregation to compromise resolution with coverage, and even bring user-defined calculations to the smallish windows to calculate derived products (rugosity, slope, distance to thresholds etc.) In our modelled track outputs with tripEstimation/SGAT and bsam[1] we can use the more probabilistic estimates of location as more nuanced "query buckets" against the same environmental data, it's the same issue with matching time with all the different options you might want. What is hard for us is to share the tools, because first you need to arrange access to quite a lot of data, before you get to see the magic in action. I do think that stars provides a great opportunity though, and my thoughts along those lines are here: https://github.com/mdsumner/stars.use.cases/blob/master/Use-cases.rmd#animal-tracking-mcmc-estimation [1] https://CRAN.R-project.org/package=bsam
Maybe: "everything should be as tidy as it can be but not tidier"?
I don't think we've even started with the housekeeping here. Some early thoughts are here: http://rpubs.com/cyclemumner/sc-rationale I'm interested to help find ways to nest the spdep idioms in "tidy" ways, but I'm also not sure how valuable that is yet. When you convert the structures to primitives the edge and vertex relations are inherent already, and it's a mater of standard database join/select/filter idioms to link things up: http://rpubs.com/cyclemumner/neighbours01 http://rpubs.com/cyclemumner/neighbours02 (those examples use https://github.com/mdsumner/scsf for the PRIMITIVE model but it's really too raw to use yet). This is all fine and dandy, but then you still have a list-bag - i.e. database - of tables with no strong idioms for treating them as a single object - that's the real loftier goal of tidygraph, it's certainly a clear goal of the tidyverse to be a general master of data). Is this a serious option worth pursuing for spdep? I don't know, but I'm pursuing it for my own reasons as detailed above. I don't think modernizing spdep would be that difficult. At the least the limitations and obstacles faced when using it should be described. I'm happy to help and I might have a go at it at some point, and I'd encourage anyone to get in and try it out.With rlang and the imminent release of the greatly improved dplyr it's a good time to start. Cheers, Mike.
Roger
On Mon, May 15, 2017 at 5:17 AM Roger Bivand <Roger.Bivand at nhh.no>
wrote:
On Sun, 14 May 2017, Tiernan Martin wrote:
You?re right about one thing: those of us introduced to the pipe in our formative years find it hard to escape (everything is better here
inside
the pipe - right?) ;)
Pipes can be great if they don't involve extra inputs at a later stage
in
the pipe, but as: https://github.com/thomasp85/tidygraph suggests, it isn't very easy, and may be forced.
Thank you for your very informative response. Before reading it, I had
a
vague inkling that matrices were part of the challenge of adapting
these
tools to work with simple features, but I also thought perhaps all this issue needed was a little nudge from a user like me. Your response made
it
clear that there is a series of technical decisions that need to be
made,
and in order to do that I imagine some exploratory testing is in order.
Of
course, it is possible that the data.frame + list-col foundation is
just
not well-suited to the needs of spatial weighting/testing tools, but I suspect that there is some valuable insight to be gained from exploring this possibility further. And as I stated in my first post, going
forward I
expect more users will be interested in seeing this sort of integration happen.
I think that there are two tasks - one to create neighbour objects from
sf
objects, then another to see whether the costs of putting the spatial weights into a data.frame (or even more challenging, a tibble) to test
for
spatial autocorrelation or model spatial dependence. Should the weights
be
squirreled away inside the object flowing down the pipe? sf_object %>% add_weights(...) -> sf_object_with_weights moran.test(sf_object_with_weights, ...) geary.test(sf_object_with_weights, ...) or sf_object %>% add_weights(...) %>% moran.test(sf_object_with_weights,
...)
-> output all in one (output is a htest object). It gets more complex in: lm.morantest(lm(formula, sf_object), sf_object_with_weights) as we need the weights and the lm output object.
How should we go about this technically? spdep is on R-Forge and is
happy
there. Its present functionality has to be maintained as it stands, it
has
too many reverse dependencies to break. Should it be re-written from scratch (with more sparse matrices internally for computation)? ... We'd need proof of concept with realistically sized data sets (not yet
NY
taxis, but maybe later ...). spdep started as spweights, sptests and spdep, and the first two got folded into the third when stable. If
weights
are the first thing to go for, sfweights is where to go first (and
port
the STRtree envelope intersections for contiguities). It could define
the
new classes, and tests and modelling would use them.
Sounds to me like this package would need to be built from the ground
up,
possibly following a similar path to the `sp` development process as
you
mentioned. Maybe that presents a challenge with regards to resources
(i.e.,
funding, time, etc.). Perhaps project this is a candidate for future proposals to the R Consortium or other funding sources. I hope that
this
post is useful in documenting user interest in seeing the `sf` protocol integrated into other spatial tools and workflows, and I look forward
to
hearing others' thoughts on the matter.
Making spatial weights from sf objects is just expanding spdep, and is feasible. There is no point looking for funding, all it needs is
knowledge
of how things have been done in spdep. Contiguity, knn, distance, graph-based neighbours are all feasible. Extending this to adding weights to sf objects may be going too far. It seems slightly forced to - say - attach a sparse matrix to an sf
geometry
column as an attribute, or to add a neighbour nested list column when writing subsetting protection is very far from obvious. Like time series data (not time stamped observations, but real ordered time series), spatial data have implicit order that is tidy in a different way than
tidy
(graph dependencies between observation rows). Maybe mapped features
with
attached attribute data are "tidier" than a table, because they would preserve their relative positions? Roger
Thanks again, Tiernan On Sat, May 13, 2017 at 1:39 PM Roger Bivand <Roger.Bivand at nhh.no>
wrote:
On Fri, 12 May 2017, Tiernan Martin wrote:
Is anyone thinking about creating an adaptation of the `spdep`
package
that
expects sf-class inputs and works well in a pipeline?
I assume that "this is not a pipeline" is a joke for insiders (in the pipe)? No, there is your issue on the sfr github repository that is relevant
for
contiguous neighbours, but not beyond that: https://github.com/edzer/sfr/issues/234 An sf is a data.frame, and as such should "just work", like "Spatial" objects have, in formula/data settings. The problem is (of course) the weights matrix. Should it be a list column (each row has a list nesting two lists,
first
indices, second non-zero weights), or remain separate as it has been
for
20 years, or become a column-oriented representation (Matrix package)
-
a
nested list like a list column or a listw obeject is row-oriented. I
had
started thinking about using a sparse column-oriented representation,
but
have not implemented functions accepting them instead of listw
objects.
I am very cautious about creating classes for data that were
data.frame,
then sf, and then have the weights built-in. In the simple case it
would
work, but all you have to do is re-order the rows and the link between
the
neighbour ids and row order breaks down; the same applies to
subsetting.
The problems to solve first are related the workflows, and easiest to
look
at in the univariate case (Moran, Geary, join-count, Mantel, G, ...)
for
global and local tests. I think that all or almost all of the NSE
verbs
will cause chaos (mutate, select, group) once weights have been
created.
If there is a way to bind the neighour IDs to the input sf object
rows,
a
list column might be possible, but we have to permit multiple such
columns
(Queen, Rook, ...), and ensure that subsetting and row-reordering keep
the
graph relations intact (and their modifications, like
row-standardisation)
If you've seen any examples of how tidy relates to graphs (adjacency
lists
are like nb objects), we could learn from that. How should we go about this technically? spdep is on R-Forge and is
happy
there. Its present functionality has to be maintained as it stands, it
has
too many reverse dependencies to break. Should it be re-written from scratch (with more sparse matrices internally for computation)? Your example creates weights on the fly for a local G* map, but has
n=100,
not say n=90000 (LA census blocks). Using the sf_ geos based methods
does
not use STRtrees, which cut the time needed to find contigious
neighbours
from days to seconds. We ought to pre-compute weights, but this messes
up
the flow of data, because a second lot of data (the weights) have to
enter
the pipe, and be matched with the row-ids. We'd need proof of concept with realistically sized data sets (not yet
NY
taxis, but maybe later ...). spdep started as spweights, sptests and spdep, and the first two got folded into the third when stable. If
weights
are the first thing to go for, sfweights is where to go first (and
port
the STRtree envelope intersections for contiguities). It could define
the
new classes, and tests and modelling would use them. Thanks for starting this discussion. Roger
I understand that there is skepticism about the wisdom of adopting
the
?tidyverse? principles throughout the R package ecosystem, and I
share
the
concern that an over-reliance on any single paradigm could reduce the resilience and diversity of the system as a whole. That said, I believe that the enthusiastic adoption of the `sf`
package
and
the package's connections with widely-used tidyverse packages like
`dplyr`
and `ggplot2` may result in increased demand for sf-friendly spatial analysis tools. As an amateur who recently started using R as my
primary
GIS tool, it seems like the tidyverse's preference for dataframes, S3 objects, list columns, and pipeline workflows would be well-suited to
the
field of spatial analysis. Are there some fundamental reasons why the `spdep` tools cannot (or should not) be adapted to the tidyverse
"dialect"?
Let me put the question in the context of an actual analysis: in
February
2017, the pop culture infovis website The Pudding (
)
published an analysis of regional preferences for Oscar-nominated
films
in
the US (https://pudding.cool/2017/02/oscars_so_mapped/). A few days
ago,
the author posted a tutorial explaining the method of ?regional
smoothing?
used to create the article?s choropleths ( https://pudding.cool/process/regional_smoothing/). The method relies on several `spdep` functions (
).
In the code below, I provide reprex with a smaller dataset included
in
the
`sf` package:
library(sf)
library(spdep)
nc <- st_read(system.file("shape/nc.shp", package = "sf")) # North
Carolina counties
nc_shp <- as(nc,'Spatial')
coords <- coordinates(nc_shp)
IDs<-row.names(as(nc_shp, "data.frame"))
knn5 <- knn2nb(knearneigh(coords, k = 5), row.names = IDs) # find
the
nearest neighbors for each county knn5 <- include.self(knn5) localGvalues <- localG(x = as.numeric(nc_shp at data$NWBIR74), listw = nb2listw(knn5, style = "B"), zero.policy = TRUE) # calculate the G
scores
localGvalues <- round(localGvalues,3)
nc_shp at data$LOCAL_G <- as.numeric(localGvalues)
p1 <- spplot(nc_shp, c('NWBIR74'))
p2 <- spplot(nc_shp, c('LOCAL_G'))
plot(p1, split=c(1,1,2,2), more=TRUE)
plot(p2, split=c(1,2,2,2), more=TRUE)
Here?s what I imagine that would look like in a tidyverse pipeline
(please
note that this code is for illustrative purposes and will not run):
library(tidyverse)
library(purrr)
library(sf)
library(sfdep) # this package doesn't exist (yet)
nc <- st_read(system.file("shape/nc.shp", package = "sf"))
nc_g <-
nc %>%
mutate(KNN = map(.x = geometry, ~ sfdep::st_knn(.x, k = 5,
include.self
=
TRUE)), # find the nearest neighbors for each county
NB_LIST = map(.x = KNN, ~ sfdep::st_nb_list(.x, style =
'B')), #
make a list of the neighbors using the binary method
LOCAL_G = sfdep::st_localG(x = NWBIR74, listw = NB_LIST,
zero.policy = TRUE), # calculate the G scores
LOCAL_G = round(LOCAL_G,3))
We can see that the (hypothetical) tidyverse version reduces the
amount
of
intermediate objects and wraps the creation of the G scores into a
single
code chunk with clear steps. I'd be grateful to hear from the users and developers of the `spdep`
and
`sf` packages about this topic!
Tiernan Martin
[[alternative HTML version deleted]]
_______________________________________________ R-sig-Geo mailing list R-sig-Geo at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-geo
-- Roger Bivand Department of Economics, Norwegian School of Economics, Helleveien 30, N-5045 Bergen, Norway. voice: +47 55 95 93 55 <+47%2055%2095%2093%2055>
<+47%2055%2095%2093%2055>
<+47%2055%2095%2093%2055>; e-mail:
Roger.Bivand at nhh.no Editor-in-Chief of The R Journal,
-- Roger Bivand Department of Economics, Norwegian School of Economics, Helleveien 30, N-5045 Bergen, Norway. voice: +47 55 95 93 55 <+47%2055%2095%2093%2055>
<+47%2055%2095%2093%2055>; e-mail:
Roger.Bivand at nhh.no Editor-in-Chief of The R Journal,
-- Roger Bivand Department of Economics, Norwegian School of Economics, Helleveien 30, N-5045 Bergen, Norway. voice: +47 55 95 93 55 <+47%2055%2095%2093%2055>; e-mail: Roger.Bivand at nhh.no Editor-in-Chief of The R Journal, https://journal.r-project.org/index.html http://orcid.org/0000-0003-2392-6140 https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
_______________________________________________ R-sig-Geo mailing list R-sig-Geo at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Dr. Michael Sumner Software and Database Engineer Australian Antarctic Division 203 Channel Highway Kingston Tasmania 7050 Australia [[alternative HTML version deleted]]
Interesting discussion, thanks for bringing this up here. Some ideas: I agree that there's more than one answer to the question what is tidy in the context of analysing spatial data. Compared to the S4 classes in sp, with the simplicity of S3 and data.frame's, there's also a certain risk/loss, two examples came up the last few days in [1] and [2]. I don't think that the risks of adding neighbourhood lists or sparse weights matrices to sf/data.frame objects are really large - right now I'd guess (untried!) it is also possible to create nb lists, then subset data and this list or sparse matrix, and end up with a meaningless mess relatively unnoticed (some luck needed). It might be a good idea to give these lists a certain class (right now `st_intersects` or `st_rook` return an unclassed list), and modify the `filter.sf` method such that it will warn when these objects are subsetted - in which case they would loose their meaning. I'm not exactly sure how to pack a sparse weights matrix in a list-column, but that can't be hard; somehow I feel that sparse matrices from Matrix would not fit in, since they're not lists - convert on the fly? [1] https://github.com/edzer/sfr/issues/343 [2] https://github.com/edzer/sfr/issues/340
On 16/05/17 14:27, Michael Sumner wrote:
Hello, I'm following this conversation with great interest. There's a lot here for me to work with, it's an active pursuit and I don't have much to offer yet but I have a couple of inline responses below. I've learnt a lot about spdep from this discussion, something I've meant to explore for a long time but my work has never had the same modelling focus). On Tue, 16 May 2017 at 18:25 Roger Bivand <Roger.Bivand at nhh.no> wrote:
On Tue, 16 May 2017, Tiernan Martin wrote:
Should the weights be squirreled away inside the object flowing down the pipe?
I'd propose that the that functions be written in such a way that the neighbors, weights, and test results can be stored in list cols within
the
data.frame. This way multiple tests (and their inputs) can be stored in a single, easily comprehensible, rectangular data object. Here's some more pretend code to illustrate that concept with regard to
sf:
nc %>%
group_by(NAME) %>%
nest %>%
mutate(
geometry = map(.x = data, ~.x[['geometry']]),
neighbors = map(.x = data, .f = my_nb), # creates a list col of
the neighbors
weights = map(.x = data, .f = add_weights), # creates a weights
object (class=wt, is this a sparse matrix in spdep?)
test_moran = map2(.x = data, .y = weights, .f = my_moran), #
creates list of Moran's I and sample kurtosis of x
test_geary = map2(.x = data, .y = weights, .f = my_geary) #
creates list of Geary's C and sample kurtosis of x
)
#> # A tibble: 100 x 7
#> NAME data geometry neighbors weights
test_moran test_geary
#> <fctr> <list> <list> <list> <list>
<list> <list>
#> 1 Ashe <tibble [1 x 14]> <simple_feature> <list [2]> <S3: wt>
<list [2]> <list [2]>
#> 2 Alleghany <tibble [1 x 14]> <simple_feature> <list [5]> <S3: wt>
<list [2]> <list [2]>
#> 3 Surry <tibble [1 x 14]> <simple_feature> <list [4]> <S3: wt>
<list [2]> <list [2]>
#> 4 Currituck <tibble [1 x 14]> <simple_feature> <list [2]> <S3: wt>
<list [2]> <list [2]>
#> 5 Northampton <tibble [1 x 14]> <simple_feature> <list [3]> <S3: wt>
<list [2]> <list [2]>
#> 6 Hertford <tibble [1 x 14]> <simple_feature> <list [6]> <S3: wt>
<list [2]> <list [2]>
#> 7 Camden <tibble [1 x 14]> <simple_feature> <list [5]> <S3: wt>
<list [2]> <list [2]>
#> 8 Gates <tibble [1 x 14]> <simple_feature> <list [2]> <S3: wt>
<list [2]> <list [2]>
#> 9 Warren <tibble [1 x 14]> <simple_feature> <list [5]> <S3: wt>
<list [2]> <list [2]>
#> 10 Stokes <tibble [1 x 14]> <simple_feature> <list [2]> <S3: wt>
<list [2]> <list [2]>
#> # ... with 90 more rows
Examples of this nested data.frame / list col approach can be found here:
http://r4ds.had.co.nz/many-models.html#nested-data
Thanks, interesting but really scary. You get numeric output for local Moran's I and local Geary, but what will happen next? All of these tests assume a correctly specified mean model. Without this assumption being met (absence of trends, absence of global spatial autocorrelation in the local test case, absence of missing right-hand-side variables and absence of incorrect functional forms for these, ...), a map pretending to invite "inference" of "hotspots" etc. - collections of observations with similar values of the local statistic, will most often be highly misleading. This isn't only about whether it is possible to do technically, but also about not making it simpler than reason would suggest. With a correct mean model, there may not be any (local) spatial autocorrelation left. See ?localmoran.sad and ?localmoran.exact, and their references, and McMillen (2003) (the McSpatial author). I won't even mention the need to adjust p.values for multiple comparisons ...
Extending this to adding weights to sf objects may be going too far. It seems slightly forced to - say - attach a sparse matrix to an sf
geometry
column as an attribute, or to add a neighbour nested list column when writing subsetting protection is very far from obvious. Like time series data (not time stamped observations, but real ordered time series), spatial data have implicit order that is tidy in a different way than
tidy
(graph dependencies between observation rows). Maybe mapped features
with
attached attribute data are "tidier" than a table, because they would preserve their relative positions?
This seems like the crux of the problem. Are you saying that the
orthogonal
"tidyness" of the tidyverse is fundamentally different from the
underlying
relationships of spatial data, and therefore there's a limit to the usefulness of tidyverse approach to spatial analysis?
Right, this is central and open. Spatial objects can be tidy in the tidyverse sense - sf shows this can be done. The open question is whether this extends to the analysis of the data.
I think the structures are still wide open, unless you think the goals of sf encompass everything. The simple features standard is one corner of the available ways to structure spatial data. It pointedly ignores relations between objects, and heavily relies on 2D-specific optimizations. All the topology is done "on the fly" and is treated as expendable. I personally want structures that treat simple features and similar formats as expendable, but that maintain the topology as a first-class citizen. I don't say this because I think simple features is not important, it's just not my main focus. I want time, depth, temperature, salinity, anything and everything storable on features, parts, edges, vertices and nodes. Lots of tools do this or aspects of it, and none of them can be shoe-horned into simple features, and they don't have a unifying framework. (It's called "mesh structures", or the simplicial complex, or "a database" in other contexts, but none of those are really enough). (I'm dismayed by GIS ideas when they are spoken about as if it's the end of the story. Sf is really great, but it's now more than the standard and it's kind of unique in the available tools for the standard itself - but most of them are pretty specialized and different to each other, probably unavoidably). ggplot2 and tidygraph still have a long way to go, but in there I believe is a language not just of graphics and of converting to and from igraph, but of flexible ways of specifying how spatial data are constructed, analysed and interacted with. Ggplot builds geometries, but it has no output other than a plot, and the object that can modify that plot. Select, group-by, arrange, filter, nest, and normalization into multiple tables are all part of the powerful ways in which raw data can be given structure. "Spatial" is just one small part of that structure, and unfortunately the 'geographic spatial', the 2D optimizations baked in by decades old GIS practice seems to have the most sway in discussions. This community could find a way to bake-in the geometries from ggplot constructions as sf objects, i.e. convert gg-plot into sf, not into a plot. I think that's a digestable project that would really provide some valuable insights.
From looking around, it seems that time series and graphs suffer from the same kinds of questions, and it will be interesting to see what stars turns up for spatio-temporal data: https://github.com/edzer/stars I see that stars wisely does not include trip/trajectory data structures, but analysis might involve reading the array-based representation of spatio-temporal environmental drivers for an ensemble of buffers around trajectories (say animal movement), then modelling.
Interesting that you mention this, I agree it's a crux area in modern spatial, tracking has been neglected for years but it's starting to be looked at. I don't see this as terribly difficult, in fact we do this routinely for all kinds of tracking data both as-measured and modelled. The difficulty is having access to the data as well as the motivation in the same place. We have in-house tools that sit on large collections of remotely sensed time series data: https://github.com/AustralianAntarcticDataCentre/raadsync When you have both the motivation (lots of animal tracking and voyage data) and the data on hand it's relatively easy (we have a ready audience at in Southern Ocean ecosystems research here). First step is to write a read-data tool as a function of time. Then the problem of matching tracking data to the right space-time window is easily broken down into the intervals between the time series of environmental data. At that level, you can employ interpolation between the layers, aggregation to compromise resolution with coverage, and even bring user-defined calculations to the smallish windows to calculate derived products (rugosity, slope, distance to thresholds etc.) In our modelled track outputs with tripEstimation/SGAT and bsam[1] we can use the more probabilistic estimates of location as more nuanced "query buckets" against the same environmental data, it's the same issue with matching time with all the different options you might want. What is hard for us is to share the tools, because first you need to arrange access to quite a lot of data, before you get to see the magic in action. I do think that stars provides a great opportunity though, and my thoughts along those lines are here: https://github.com/mdsumner/stars.use.cases/blob/master/Use-cases.rmd#animal-tracking-mcmc-estimation [1] https://CRAN.R-project.org/package=bsam
Maybe: "everything should be as tidy as it can be but not tidier"?
I don't think we've even started with the housekeeping here. Some early thoughts are here: http://rpubs.com/cyclemumner/sc-rationale I'm interested to help find ways to nest the spdep idioms in "tidy" ways, but I'm also not sure how valuable that is yet. When you convert the structures to primitives the edge and vertex relations are inherent already, and it's a mater of standard database join/select/filter idioms to link things up: http://rpubs.com/cyclemumner/neighbours01 http://rpubs.com/cyclemumner/neighbours02 (those examples use https://github.com/mdsumner/scsf for the PRIMITIVE model but it's really too raw to use yet). This is all fine and dandy, but then you still have a list-bag - i.e. database - of tables with no strong idioms for treating them as a single object - that's the real loftier goal of tidygraph, it's certainly a clear goal of the tidyverse to be a general master of data). Is this a serious option worth pursuing for spdep? I don't know, but I'm pursuing it for my own reasons as detailed above. I don't think modernizing spdep would be that difficult. At the least the limitations and obstacles faced when using it should be described. I'm happy to help and I might have a go at it at some point, and I'd encourage anyone to get in and try it out.With rlang and the imminent release of the greatly improved dplyr it's a good time to start. Cheers, Mike.
Roger
On Mon, May 15, 2017 at 5:17 AM Roger Bivand <Roger.Bivand at nhh.no>
wrote:
On Sun, 14 May 2017, Tiernan Martin wrote:
You?re right about one thing: those of us introduced to the pipe in our formative years find it hard to escape (everything is better here
inside
the pipe - right?) ;)
Pipes can be great if they don't involve extra inputs at a later stage
in
the pipe, but as: https://github.com/thomasp85/tidygraph suggests, it isn't very easy, and may be forced.
Thank you for your very informative response. Before reading it, I had
a
vague inkling that matrices were part of the challenge of adapting
these
tools to work with simple features, but I also thought perhaps all this issue needed was a little nudge from a user like me. Your response made
it
clear that there is a series of technical decisions that need to be
made,
and in order to do that I imagine some exploratory testing is in order.
Of
course, it is possible that the data.frame + list-col foundation is
just
not well-suited to the needs of spatial weighting/testing tools, but I suspect that there is some valuable insight to be gained from exploring this possibility further. And as I stated in my first post, going
forward I
expect more users will be interested in seeing this sort of integration happen.
I think that there are two tasks - one to create neighbour objects from
sf
objects, then another to see whether the costs of putting the spatial weights into a data.frame (or even more challenging, a tibble) to test
for
spatial autocorrelation or model spatial dependence. Should the weights
be
squirreled away inside the object flowing down the pipe? sf_object %>% add_weights(...) -> sf_object_with_weights moran.test(sf_object_with_weights, ...) geary.test(sf_object_with_weights, ...) or sf_object %>% add_weights(...) %>% moran.test(sf_object_with_weights,
...)
-> output all in one (output is a htest object). It gets more complex in: lm.morantest(lm(formula, sf_object), sf_object_with_weights) as we need the weights and the lm output object.
How should we go about this technically? spdep is on R-Forge and is
happy
there. Its present functionality has to be maintained as it stands, it
has
too many reverse dependencies to break. Should it be re-written from scratch (with more sparse matrices internally for computation)? ... We'd need proof of concept with realistically sized data sets (not yet
NY
taxis, but maybe later ...). spdep started as spweights, sptests and spdep, and the first two got folded into the third when stable. If
weights
are the first thing to go for, sfweights is where to go first (and
port
the STRtree envelope intersections for contiguities). It could define
the
new classes, and tests and modelling would use them.
Sounds to me like this package would need to be built from the ground
up,
possibly following a similar path to the `sp` development process as
you
mentioned. Maybe that presents a challenge with regards to resources
(i.e.,
funding, time, etc.). Perhaps project this is a candidate for future proposals to the R Consortium or other funding sources. I hope that
this
post is useful in documenting user interest in seeing the `sf` protocol integrated into other spatial tools and workflows, and I look forward
to
hearing others' thoughts on the matter.
Making spatial weights from sf objects is just expanding spdep, and is feasible. There is no point looking for funding, all it needs is
knowledge
of how things have been done in spdep. Contiguity, knn, distance, graph-based neighbours are all feasible. Extending this to adding weights to sf objects may be going too far. It seems slightly forced to - say - attach a sparse matrix to an sf
geometry
column as an attribute, or to add a neighbour nested list column when writing subsetting protection is very far from obvious. Like time series data (not time stamped observations, but real ordered time series), spatial data have implicit order that is tidy in a different way than
tidy
(graph dependencies between observation rows). Maybe mapped features
with
attached attribute data are "tidier" than a table, because they would preserve their relative positions? Roger
Thanks again, Tiernan On Sat, May 13, 2017 at 1:39 PM Roger Bivand <Roger.Bivand at nhh.no>
wrote:
On Fri, 12 May 2017, Tiernan Martin wrote:
Is anyone thinking about creating an adaptation of the `spdep`
package
that
expects sf-class inputs and works well in a pipeline?
I assume that "this is not a pipeline" is a joke for insiders (in the pipe)? No, there is your issue on the sfr github repository that is relevant
for
contiguous neighbours, but not beyond that: https://github.com/edzer/sfr/issues/234 An sf is a data.frame, and as such should "just work", like "Spatial" objects have, in formula/data settings. The problem is (of course) the weights matrix. Should it be a list column (each row has a list nesting two lists,
first
indices, second non-zero weights), or remain separate as it has been
for
20 years, or become a column-oriented representation (Matrix package)
-
a
nested list like a list column or a listw obeject is row-oriented. I
had
started thinking about using a sparse column-oriented representation,
but
have not implemented functions accepting them instead of listw
objects.
I am very cautious about creating classes for data that were
data.frame,
then sf, and then have the weights built-in. In the simple case it
would
work, but all you have to do is re-order the rows and the link between
the
neighbour ids and row order breaks down; the same applies to
subsetting.
The problems to solve first are related the workflows, and easiest to
look
at in the univariate case (Moran, Geary, join-count, Mantel, G, ...)
for
global and local tests. I think that all or almost all of the NSE
verbs
will cause chaos (mutate, select, group) once weights have been
created.
If there is a way to bind the neighour IDs to the input sf object
rows,
a
list column might be possible, but we have to permit multiple such
columns
(Queen, Rook, ...), and ensure that subsetting and row-reordering keep
the
graph relations intact (and their modifications, like
row-standardisation)
If you've seen any examples of how tidy relates to graphs (adjacency
lists
are like nb objects), we could learn from that. How should we go about this technically? spdep is on R-Forge and is
happy
there. Its present functionality has to be maintained as it stands, it
has
too many reverse dependencies to break. Should it be re-written from scratch (with more sparse matrices internally for computation)? Your example creates weights on the fly for a local G* map, but has
n=100,
not say n=90000 (LA census blocks). Using the sf_ geos based methods
does
not use STRtrees, which cut the time needed to find contigious
neighbours
from days to seconds. We ought to pre-compute weights, but this messes
up
the flow of data, because a second lot of data (the weights) have to
enter
the pipe, and be matched with the row-ids. We'd need proof of concept with realistically sized data sets (not yet
NY
taxis, but maybe later ...). spdep started as spweights, sptests and spdep, and the first two got folded into the third when stable. If
weights
are the first thing to go for, sfweights is where to go first (and
port
the STRtree envelope intersections for contiguities). It could define
the
new classes, and tests and modelling would use them. Thanks for starting this discussion. Roger
I understand that there is skepticism about the wisdom of adopting
the
?tidyverse? principles throughout the R package ecosystem, and I
share
the
concern that an over-reliance on any single paradigm could reduce the resilience and diversity of the system as a whole. That said, I believe that the enthusiastic adoption of the `sf`
package
and
the package's connections with widely-used tidyverse packages like
`dplyr`
and `ggplot2` may result in increased demand for sf-friendly spatial analysis tools. As an amateur who recently started using R as my
primary
GIS tool, it seems like the tidyverse's preference for dataframes, S3 objects, list columns, and pipeline workflows would be well-suited to
the
field of spatial analysis. Are there some fundamental reasons why the `spdep` tools cannot (or should not) be adapted to the tidyverse
"dialect"?
Let me put the question in the context of an actual analysis: in
February
2017, the pop culture infovis website The Pudding (
)
published an analysis of regional preferences for Oscar-nominated
films
in
the US (https://pudding.cool/2017/02/oscars_so_mapped/). A few days
ago,
the author posted a tutorial explaining the method of ?regional
smoothing?
used to create the article?s choropleths ( https://pudding.cool/process/regional_smoothing/). The method relies on several `spdep` functions (
).
In the code below, I provide reprex with a smaller dataset included
in
the
`sf` package:
library(sf)
library(spdep)
nc <- st_read(system.file("shape/nc.shp", package = "sf")) # North
Carolina counties
nc_shp <- as(nc,'Spatial')
coords <- coordinates(nc_shp)
IDs<-row.names(as(nc_shp, "data.frame"))
knn5 <- knn2nb(knearneigh(coords, k = 5), row.names = IDs) # find
the
nearest neighbors for each county knn5 <- include.self(knn5) localGvalues <- localG(x = as.numeric(nc_shp at data$NWBIR74), listw = nb2listw(knn5, style = "B"), zero.policy = TRUE) # calculate the G
scores
localGvalues <- round(localGvalues,3)
nc_shp at data$LOCAL_G <- as.numeric(localGvalues)
p1 <- spplot(nc_shp, c('NWBIR74'))
p2 <- spplot(nc_shp, c('LOCAL_G'))
plot(p1, split=c(1,1,2,2), more=TRUE)
plot(p2, split=c(1,2,2,2), more=TRUE)
Here?s what I imagine that would look like in a tidyverse pipeline
(please
note that this code is for illustrative purposes and will not run):
library(tidyverse)
library(purrr)
library(sf)
library(sfdep) # this package doesn't exist (yet)
nc <- st_read(system.file("shape/nc.shp", package = "sf"))
nc_g <-
nc %>%
mutate(KNN = map(.x = geometry, ~ sfdep::st_knn(.x, k = 5,
include.self
=
TRUE)), # find the nearest neighbors for each county
NB_LIST = map(.x = KNN, ~ sfdep::st_nb_list(.x, style =
'B')), #
make a list of the neighbors using the binary method
LOCAL_G = sfdep::st_localG(x = NWBIR74, listw = NB_LIST,
zero.policy = TRUE), # calculate the G scores
LOCAL_G = round(LOCAL_G,3))
We can see that the (hypothetical) tidyverse version reduces the
amount
of
intermediate objects and wraps the creation of the G scores into a
single
code chunk with clear steps. I'd be grateful to hear from the users and developers of the `spdep`
and
`sf` packages about this topic!
Tiernan Martin
[[alternative HTML version deleted]]
_______________________________________________ R-sig-Geo mailing list R-sig-Geo at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-geo
-- Roger Bivand Department of Economics, Norwegian School of Economics, Helleveien 30, N-5045 Bergen, Norway. voice: +47 55 95 93 55 <+47%2055%2095%2093%2055>
<+47%2055%2095%2093%2055>
<+47%2055%2095%2093%2055>; e-mail:
Roger.Bivand at nhh.no Editor-in-Chief of The R Journal,
-- Roger Bivand Department of Economics, Norwegian School of Economics, Helleveien 30, N-5045 Bergen, Norway. voice: +47 55 95 93 55 <+47%2055%2095%2093%2055>
<+47%2055%2095%2093%2055>; e-mail:
Roger.Bivand at nhh.no Editor-in-Chief of The R Journal,
-- Roger Bivand Department of Economics, Norwegian School of Economics, Helleveien 30, N-5045 Bergen, Norway. voice: +47 55 95 93 55 <+47%2055%2095%2093%2055>; e-mail: Roger.Bivand at nhh.no Editor-in-Chief of The R Journal, https://journal.r-project.org/index.html http://orcid.org/0000-0003-2392-6140 https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
_______________________________________________ R-sig-Geo mailing list R-sig-Geo at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-geo
Edzer Pebesma Institute for Geoinformatics (ifgi), University of M?nster Heisenbergstra?e 2, 48149 M?nster, Germany; +49 251 83 33081 Journal of Statistical Software: http://www.jstatsoft.org/ Computers & Geosciences: http://elsevier.com/locate/cageo/ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 473 bytes Desc: OpenPGP digital signature URL: <https://stat.ethz.ch/pipermail/r-sig-geo/attachments/20170517/3ff0d738/attachment.sig>