problem with duplicated function
On 25/05/15 09:34, Curtis Burkhalter wrote:
Hello everyone, I have two very large dataframes (~1 million rows x 5 columns), of which two of the columns are lat/long coordinates. The names of the dataframes are 'data07' and 'data 08'. Data08 has a few more sampling points than data 07 so I want to subset data08 so that it has the same number of data points as data07 using the unique lat/long coordinates. Here are the associated data structures: *str(data07)* 'data.frame': 969109 obs. of 5 variables: $ cell : int 710228 715545 720690 720824 695611 700490 700626 705371 705507 710363 ... $ prN : int 288 276 286 304 258 257 264 272 286 316 ... $ Location: Factor w/ 32 levels " ","Blacks_Fork",..: 24 24 24 24 24 24 24 24 24 24 ... $ Xcor : num -111 -111 -111 -111 -111 ... $ Ycor : num 41.7 41.7 41.7 41.7 41.8 ... *str(data08)* 'data.frame': 969810 obs. of 5 variables: $ cell : int 705528 710321 710456 715677 720762 720896 699953 700635 700771 705664 ... $ prN : int 293 281 299 278 276 266 282 255 287 280 ... $ Location: Factor w/ 31 levels "Blacks_Fork",..: 23 23 23 23 23 23 23 23 23 23 ... $ Xcor : num -111 -111 -111 -111 -111 ... $ Ycor : num 41.8 41.7 41.7 41.7 41.7 ... I've tried using the following code to accomplish my problem: tt <- rbind(data07, data08) tt.dup <- duplicated(tt[,4:5]) # marks all duplicate rows in data08 from last 2 cols #that correspond to the lat/long
I get tt.dup to be:
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE [13] FALSE TRUE TRUE TRUE FALSE TRUE FALSE TRUE
tt.dup <- tt.dup[-seq_len(nrow(data07))] # remove all data07 entries (first n)
This just throws away the first 10 entries of tt.dup, leaving
[1] FALSE TRUE FALSE TRUE TRUE TRUE FALSE TRUE FALSE TRUE
test=ddata08[tt.dup, ] # index only TRUE/duplicated elements from data08
^ This leaves the c(2,4,5,6,8,10) entries of data08.
When I run the code 'tt.dup' is FALSE for all entries, which I know isn't true.
Only 4 of the entries of tt.dup are FALSE; 6 are TRUE. I don't understand why you think that they are all FALSE. Perhaps your subsets do not accurately reflect the actual nature of your data. cheers, Rolf Turner
Here's a small subset of the data so that you can see exactly where there
are duplicates
data07[1:10,]
cell prN Location Xcor Ycor
710229 *710228 288 Sage -111.044 41.7403*
715546 *715545 276 Sage -111.044 41.7245*
720691 *720690 286 Sage -111.044 41.7131*
720825 *720824 304 Sage -111.044 41.7109*
695612 695611 258 Sage -111.043 41.7766
700491 700490 257 Sage -111.043 41.7653
700627 700626 264 Sage -111.043 41.7630
705372 705371 272 Sage -111.043 41.7517
705508 705507 286 Sage -111.043 41.7495
710364 710363 316 Sage -111.043 41.7381
data08[1:10,]
cell prN Location Xcor Ycor
705529 705528 293 Sage -111.044 41.7517
710322 *710321 281 Sage -111.044 41.7403*
710457 710456 299 Sage -111.044 41.7381
715678 *715677 278 Sage -111.044 41.7245*
720763 *720762 276 Sage -111.044 41.7131*
720897 *720896 266 Sage -111.044 41.7109*
699954 699953 282 Sage -111.043 41.7767
700636 700635 255 Sage -111.043 41.7653
700772 700771 287 Sage -111.043 41.7631
705665 705664 280 Sage -111.043 41.7495
If anyone has any suggestions as to where I might be going wrong I'd
greatly appreciate it.
Thank you
Technical Editor ANZJS Department of Statistics University of Auckland Phone: +64-9-373-7599 ext. 88276 Home phone: +64-9-480-4619