Transforming relational data

Gabor Grothendieck · 2011-02-22T16:44:49Z

On Mon, Feb 14, 2011 at 12:22 PM, mathijsdevaan wrote: > > Hi, > > I have a large dataset with info on individuals (B) that have been involved > in projects (A) during multiple years (C). The dataset contains three > columns: A, B, C. Example: > > ? A ?B ?C > 1 1 ?a ?1999 > 2 1 ?b ?1999 > 3 1 ?c ?1999 > 4 1 ?d ?1999 > 5 2 ?c ?2001 > 6 2 ?d ?2001 > 7 3 ?a ?2004 > 8 3 ?c ?2004 > 9 3 ?d ?2004 > > I am interested in how well all the individuals in a project know each > o

Gabor Grothendieck

Tue, Feb 22, 2011 8:44 AM

On Mon, Feb 14, 2011 at 12:22 PM, mathijsdevaan <mathijsdevaan at gmail.com> wrote:

First define the data frame, DF.  Note we have used column names of
proj, pers and year.  Then append a size column producing DF2.

Using sqldf merge DF2 with itself within project giving one row per
pair in M. At the same time we calculate the reciprocal of size for
each row or 0 if the two components of the pair are the same person.

Next we merge M with itself giving pairs of pairs zeroing out rows
that should not contribute to the sum and aggregating the reciprocal
sizes by project.

The automatic class assignment heuristic does not work well in this
case so we use method = "raw" to bypass it.

DF <-
structure(list(proj = c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L),
    pers = structure(c(1L, 2L, 3L, 4L, 3L, 4L, 1L, 3L, 4L), .Label = c("a",
    "b", "c", "d"), class = "factor"), year = c(1999L, 1999L,
    1999L, 1999L, 2001L, 2001L, 2004L, 2004L, 2004L)), .Names = c("proj",
"pers", "year"), row.names = c("1", "2", "3", "4", "5", "6",
"7", "8", "9"), class = "data.frame")

library(sqldf)

DF2 <- transform(DF, size = ave(proj, proj, FUN = length))

M <- sqldf("select proj, year, x.pers || ' ' || y.pers pair,
 (x.pers != y.pers) / (size + 0.0) recip from DF2 x, DF2 y
 using(proj, year, size)", method = "raw")

sqldf("select x.proj,
	  sum((x.year > y.year and x.recip > 0) * y.recip) familiarity
    from M x, M y using (pair) group by x.proj", method = "raw")

The result of the last statement is:

  proj familiarity
1    1         0.0
2    2         0.5
3    3         2.5

More info on sqldf is available at http://sqldf.googlecode.com


The last statement could be replaced by these two in case you want the
intermediate PP:

PP <- sqldf("select *, (x.year > y.year and x.recip > 0) * y.recip familiarity
from M x, M y using (pair)", method = "raw")

sqldf("select proj, sum(familiarity) as familiarity from PP group by proj")

Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

Transforming relational data

Thread (14 messages)