Skip to content

help matching rows of a data frame

7 messages · Terry Therneau, Eric Berger, Jeff Newmiller +4 more

#
This question likely has a 1 line answer, I'm just not seeing it.  (2, 3, or 10 lines is 
fine too.)

For a vector I can do group  <- match(x, unqiue(x)) to get a vector that labels each 
element of x.
What is an equivalent if x is a data frame?

The result does not have to be fast: the data set will have < 100 elements.  Since this is 
inside the survival package, and that package is on  the 'recommended' list, I can't 
depend on any package outside the recommended list.

Terry T.
#
Hi Terry,
I take your question to mean how to label distinct rows of a data frame. If
that is not your question please clarify.
I found the row.match() function in the package prodlim that can be used to
solve this.
However since your request requires no additional dependencies I borrowed
the relevant code from the row.match function.
Here is some obfuscated code to provide your answer in one line, per your
request. (less obfuscated code just below that.

Assuming your data frame is called 'df':

df[,ncol(df)+1] <- match( do.call("paste", c(df[, , drop = FALSE], sep =
"\\r")), do.call("paste", c(unique(df)[, , drop = FALSE], sep = "\\r")) )

The last column of df now contains the 'label' i.e. the row number of the
first row in df that is the same as the given row.

Somewhat less obfuscated

getLabels <- function(df) {
                          match( do.call("paste", c(df[, , drop = FALSE],
sep = "\\r")),
                                     do.call("paste", c(unique(df)[, , drop
= FALSE], sep = "\\r")) )
                     }

myDataFrame$label <- getLabels(myDataFrame)


HTH,

Eric


On Mon, Sep 18, 2017 at 3:13 PM, Therneau, Terry M., Ph.D. <
therneau at mayo.edu> wrote:

            

  
  
#
"Label" is not a clear term for data frames,  but most data frames have rownames. If dta is a data frame, not a tibble, 

rownames( dta )[ !duplicated( dta ) ]

Or could use row indexes directly

which( !duplicated( dta ) )
#
Hi!
2017-09-18 07:13 -0500, Therneau, Terry M., Ph.D. wrote:
Actually, you get a vector of indices matching 'unique(x)', not a
labelled vector.
[1] 1 2 3 1 3 4
So you will generate an index where duplicated rows have the row index
of the first occurrence, right? This could work:
? ? ?for (j in (i+1):nrow(x)) {?
? ? ? ? if (sum(as.numeric(x[i,]==x[j,]))==ncol(x)) {?
? ? ? ? ? ?group[j]<-group[i] }
? ? ?}
? ?}
[1] "1" "2" "3" "3" "5" "1"

HTH,
Kimmo
#
You could use merge() with an ID column pasted onto the table of names, as
in
Surname=c("Xavier","Yates","Yates","Yates","Zapf"), Id=paste0("P",101:105))
FirstName Surname   Id
1       Abe  Xavier P101
2       Abe   Yates P102
3       Bob   Yates P103
4     Chuck   Yates P104
5     Chuck    Zapf P105
Surname=rep("Yates",3)), tbl, all.x=TRUE)
  FirstName Surname   Id
1       Abe   Yates P102
2     Chuck   Yates P104
3      Dave   Yates <NA>


Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Mon, Sep 18, 2017 at 5:13 AM, Therneau, Terry M., Ph.D. <
therneau at mayo.edu> wrote:

            

  
  
#
In the past I've use apply with past to generate "group" identifiers:


x<-data.frame("X0"=c("A","B","C","C","D","A"), "X1"=c(1,2,1,1,3,1))

apply(x, 1, paste, collapse=".")
[1] "A.1" "B.2" "C.1" "C.1" "D.3" "A.1"
David Winsemius
Alameda, CA, USA

'Any technology distinguishable from magic is insufficiently advanced.'   -Gehm's Corollary to Clarke's Third Law
#
Yes. My understanding is that you want the identifier to have the same
number of rows as the data frame. A slight variant of David's solution
would then be:

do.call(paste0,x)


-- Bert



On Mon, Sep 18, 2017 at 8:29 AM, David Winsemius <dwinsemius at comcast.net>
wrote: