Skip to content

merging data with SpatialPolygonsDataFrame

8 messages · stefan lhachimi, Dylan Beaudette, Roger Bivand

#
Hello,
I am having a spatial object by reading a shp-file by the
readShapePoly command. I now want to add data into the
SpatialPolygonsDataFrame from an external source (its a dataframe
which has a common key SHN_N). I am using the merge command, which I
thought works fine, unitl I realized that "map.kreise" now became just
a data.frame.
My greatest fear in working with spatial objects is that, if I don't
have them in one object, I end up plotting a value for the wrong
region. So is there a way to use merge or what is the most appropriate
way to combine data?

map.kreise<-readShapePoly("vg250krs",IDvar="KRS_ID",verbos=TRUE) #
reads the shape file
map.kreise<-merge(map.kreise,INC,sort = FALSE,by.map.kreise="SHN_N",
by.INC="SHN_N",all.map.kreise=T,all.INC=T)

Thanks a lot in advance,
Stefan
#
On Wed, 23 Jan 2008, stefan lhachimi wrote:

            
merge() is a "false friend", because a SpatialPolygonsDataFrame can be 
coerced to a data.frame, which is what merge() does without asking.

It may be best to say corece first, merge() the two data.frames, and 
rebuild the SpatialPolygonsDataFrame with SpatialPolygonsDataFrame().

map.kreise.df <- as(map.kreise, "data.frame")
map.kreise.df1 <- merge(map.kreise.df, INC, sort=FALSE, by.x="SHN_N",
   by.y="SHN_N", all.x=TRUE, all.y=TRUE)
map.kreise1 <- SpatialPolygonsDataFrame(as(map.kreise, "SpatialPolygons"),
   data=map.kreise.df1)

which may fail if INC and map.kreise.df have different row.names. The 
argument match.ID= (default TRUE) to SpatialPolygonsDataFrame matches the 
polygon IDs to the data= row.names - they must agree exactly though the 
data frame rows will be re-ordered to match the polygons if only the 
orders differ. Your concern here is real, which is why the argument is 
there.

If you don't need merge to manipulate the data frames, you can go straight 
for:

all.equal(row.names(INC), row.names(as(map.kreise, "data.frame"))
map.kreise1 <- spCbind(map.kreise, INC)

if the row.names of INC match the polygon IDs, and thus the row.names of
as(map.kreise, "data.frame"), see ?spCbind in maptools, and the example 
included there.

Hope this helps,

Roger

  
    
#
On Thursday 24 January 2008, Roger Bivand wrote:
Hi Roger. Thanks for contributing some answers to this. 

I was recently working with a colleague on developing some sample exercises 
for new students. Since joining new attribute data to a GIS layer's table is 
a very common operation we included some samples on how to do this within R. 
You have hinted at some possible ways to do it above, but do you have a 'best 
practices' approach to doing this using 'sp' methods and objects?

For example:

# contains an attribute col named 'veg_code'
veg <- readOGR(something.shp)

# code meanings: indexed by 'veg_code'
codes <- read.dbf(table.dbf)

# what is the best way to join up the attributes in 'veg' with the rows 
in 'codes' ?


As of now we are using merge to replace the dataframe slot of the original 
file. We first re-order the results from merge to match the original row 
ordering:


# an example file:
veg <- readOGR(dsn='ArcGISLabData/BrownsPond/', layer='vegevector')

# some example codes
veg_codes <- data.frame(code=1:4, meaning=c('code 1','code 2','code 3','code 
4'))

# join the original data table with the veg codes table
combined <- merge(x=veg at data, y=veg_codes, by.x='CODE', by.y='code')


# overwrite the original data frame with the combined version
# note that the original order needs to be restored
# since the original data was sorted on 'ID', we can use that to restore
# the correct order in the 'combined' dataframe:
v at data <- combined[order(combined$ID),]


In summary, is there a safer or preferred way to do this?

thanks,

Dylan
#
On Thu, 24 Jan 2008, Dylan Beaudette wrote:

            
Hi Dylan,

This is a different question, but I won't break it out of this thread yet.

You are doing look-up on codes to give labels to veg$veg_code, right?

veg$veg_code are integer indices to codes$V1 (say V1, I don't know what it 
is). If length(unique(veg$veg_code)) == length(codes$V1), and 
sort(unique(veg$veg_code)) is 1:length(codes$V1), you should think of the 
factor as your friend:

veg$veg_code_factor <- factor(veg$veg_code, labels=as.character(codes$V1))

If not, you need another layer using perhaps order() or match() on the 
matching substring of codes$V1 to find out which value of veg$veg_code 
should have which label in as.character(codes$V1). Alternatively use the 
levels= argument to factor().

Something like:

set.seed(1)
veg_code <- rpois(100, 4)
table(veg_code)
V1 <- paste("code", 0:10)
V1
levs <- 0:10
veg_code_factor <- factor(veg_code, levels=levs, labels=V1)
table(veg_code_factor, veg_code)

No merging or messing with veg itself is needed, apart from adding a 
single extra factor column. The factor abstraction is a great strength of 
the S language.

Have I misunderstood you?

Roger

  
    
#
On Thursday 24 January 2008, Roger Bivand wrote:
I think so. 

I was (trying to) describe the process of joining, either 1:1 or many:1, the 
att table associated with an sp object and some other data frame.

merge() seems to work fine, but the order of the rows are different from the 
original data frame attached to the sp object.

My question was on the best way to 'update' the data frame attached to an sp 
object, based on the results from a merge() with some other data.

does that help?

cheers,

Dylan

  
    
#
On Thu, 24 Jan 2008, Dylan Beaudette wrote:

            
My experience is that merge() is often a false friend, because of object 
coercion and sorting, as you suggest. The rules for SpatialLinesDataFrame 
and SpatialPolygonsDataFrame are simple, and SpatialPointsDataFrame (and 
by extension SpatialPixelsDataFrame) can be made as simple. The first two
constructor functions take a match.ID= argument, default TRUE, which 
matches the geometry ID - the ID slot in the component Lines or Polygons 
objects - to the data.frame row.name. For SpatialPointsDataFrame(), the 
same argument exists, but is only used if the coordinate matrix has row 
names to be matched to the data.frame row names. If it has been used, the 
data slot row names are the geometry IDs.
trial and error is needed to reach a satisfactory result, that is a single 
data frame with row names matching the geometry IDs (not necessarily in 
the right order, but the same set of IDs).
For one to one, there isn't really an alternative to extracting the data 
slot, do the merge, check the geometry IDs against the output data frame 
row names, and re-construct the object. That is, essentially where this 
thread began.

If there is less data than geometries, merge() should fill out with NAs. 
If there is more data than geometries, the output data frame will need 
subsetting. Both match(), order(), and %in% are very handy here.

For one geometry list to many - see reshape() first, for example to 
flatten a "tall" set of space/time observations so that the time values 
become new attribute columns. This could be met stations stacked by 
station and date, which need widening to stations by date*attributes

If need be, the IDs of the geometries can be changed too, see 
?"spChFIDs-methods" in maptools.

Any closer?

Roger

  
    
#
On Friday 25 January 2008, Roger Bivand wrote:
Thanks for the detailed response Roger.
I think that you have answered my questions and reaffirmed my thoughts on how 
to proceed. I am mostly interested in 1:1 and 1:many joins (geom:attr), and I 
think that the following approach should work in the general case:

# read vector data:
v <- readOGR(dsn='...', layer='...')

# safer approach to 1:1 or 1:many (geom:atts) joins
# add a sorting id for later use
v at data$sorting_id <- 1:nrow(v at data)

# make a copy of the original table
orig.table <- v at data

# make the table with 'joined' data
new.table <- merge(x=orig.table, y=veg_codes, by.x='CODE', by.y='code')

# re-order this table based on the original row order
new.table.ordered <- new.table[order(new.table$sorting_id), ]

# restore the origina row names
row.names(new.table.ordered) <- row.names(orig.table)

# replace the data table
v at data <- new.table.ordered


This should ensure that the merged rows are in the original order, and have 
the original row.names.

Does that sound correct?

Thanks,

Dylan

  
    
#
On Fri, 25 Jan 2008, Dylan Beaudette wrote:

            
v$sorting_id <- 1:nrow(as(v, "data.frame"))

is cleaner, there is absolutely no need to access the slot directly with 
the @ operator. Almost all R functions do lazy evaluation, so do not 
create new objects unless needed.

If you are using SpatialLines* or SpatialPolygons*, this is better, 
because it uses the geometry IDs (for v SpatialPolygons*):

v$sorting_id <- sapply(slot(v, "polygons"), function(x) slot(x, "ID"))
orig.table <- as(v, "data.frame")
maybe sort=FALSE, you can also try by.x=0, which are the geometry IDs
First check the identity of the two, recalling that character strings may 
be cast to factors when constructing data frames.
Rather not, make a v1 and leave the input v in peace. Reading history 
files a couple of days later where objects get overwritten is ten times 
harder than creating new objects and if need be deleting old ones. I find 
that copying and pasting out-of-order history files can lead to lots of 
confusion if object names are overwritten.
Use of all.equal() and identical() on the IDs, row names, etc., is still 
very comforting. Think locales, and you can see what might creep in when 
swapping data with others - just checking takes little time compared to 
recovering from an overconfident assumption that the data and the 
geometries are correctly attached.

It can be done through databases too, the aRT R/Terralib interface offers 
stronger control, and similar effects can be had with OGR/PostGIS.

Best wishes,

Roger