Making tapply code more efficient

On something the size of your data it took about 30 seconds to
determine the number of unique teachers per student.
x <- cbind(sample(326397, 800967, TRUE), sample(20, 800967, TRUE))
# split the data so you have the number of teachers per student
system.time(t.s <- split(x[,2], x[,1]))
user  system elapsed
   0.92    0.01    0.94
t.s[1:7]  # sample data
$`1`
[1] 16

$`2`
[1] 3

$`3`
[1] 1

$`4`
[1] 17

$`6`
[1]  9  9 19

$`7`
[1] 20

$`9`
[1]  3 16 16 10  8 17
# count number of unique teachers per student
system.time(t.a <- sapply(t.s, function(x) length(unique(x))))
user  system elapsed
  20.17    0.10   20.26

t.a[1:10]
1  2  3  4  6  7  9 10 11 12
 1  1  1  1  2  1  5  1  1  1
Previously, I posed the question pasted down below to the list and
received some very helpful responses. While the code suggestions
provided in response indeed work, they seem to only work with *very*
small data sets and so I wanted to follow up and see if anyone had ideas
for better efficiency. I was quite embarrased on this as our SAS
programmers cranked out programs that did this in the blink of an eye
(with a few variables), but R was spinning for days on my Ubuntu machine
and ultimately I saw a message that R was "killed".

The data I am working with has 800967 total rows and 31 total columns.
The ID variable I use as the index variable in tapply() has 326397
unique cases.

length(unique(qq$student_unique_id))
[1] 326397

To give a sense of what my data look like and the actual problem,
consider the following:

qq <- data.frame(student_unique_id = factor(c(1,1,2,2,2)),
teacher_unique_id = factor(c(10,10,20,20,25)))

This is a student achievement database where students occupy multiple
rows in the data and the variable teacher_unique_id denotes the class
the student was in. What I am doing is looking to see if the teacher is
the same for each instance of the unique student ID. So, if I implement
the following:

same <- function(x) length( unique(x) ) == 1
results <- data.frame(
? ? ? ?freq = tapply(qq$student_unique_id, qq$student_unique_id,
length),
? ? ? ?tch = tapply(qq$teacher_unique_id, qq$student_unique_id, same)
)

I get the following results. I can see that student 1 appears in the
data twice and the teacher is always the same. However, student 2
appears three times and the teacher is not always the same.

results
?freq ? tch
1 ? ?2 ?TRUE
2 ? ?3 FALSE

Now, implementing this same procedure to a large data set with the
characteristics described above seems to be problematic in this
implementation.

Does anyone have reactions on how this could be more efficient such that
it can run with large data as I described?

Harold

sessionInfo()
R version 2.8.1 (2008-12-22)
x86_64-pc-linux-gnu

locale:
LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.U
TF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=
C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATI
ON=C

attached base packages:
[1] stats ? ? graphics ?grDevices utils ? ? datasets ?methods ? base

##### Original question posted on 1/13/09
Suppose I have a dataframe as follows:

dat <- data.frame(id = c(1,1,2,2,2), var1 = c(10,10,20,20,25), var2 =
c('foo', 'foo', 'foo', 'foobar', 'foo'))

Now, if I were to subset by id, such as:

subset(dat, id==1)
?id var1 var2
1 ?1 ? 10 ?foo
2 ?1 ? 10 ?foo

I can see that the elements in var1 are exactly the same and the
elements in var2 are exactly the same. However,

subset(dat, id==2)
?id var1 ? var2
3 ?2 ? 20 ? ?foo
4 ?2 ? 20 foobar
5 ?2 ? 25 ? ?foo

Shows the elements are not the same for either variable in this
instance. So, what I am looking to create is a data frame that would be
like this

id ? ? ?freq ? ?var1 ? ?var2
1 ? ? ? 2 ? ? ? TRUE ? ?TRUE
2 ? ? ? 3 ? ? ? FALSE ? FALSE

Where freq is the number of times the ID is repeated in the dataframe. A
TRUE appears in the cell if all elements in the column are the same for
the ID and FALSE otherwise. It is insignificant which values differ for
my problem.

The way I am thinking about tackling this is to loop through the ID
variable and compare the values in the various columns of the dataframe.
The problem I am encountering is that I don't think all.equal or
identical are the right functions in this case.

So, say I was wanting to compare the elements of var1 for id ==1. I
would have

x <- c(10,10)

Of course, the following works

all.equal(x[1], x[2])
[1] TRUE

As would a similar call to identical. However, what if I only have a
vector of values (or if the column consists of names) that I want to
assess for equality when I am trying to automate a process over
thousands of cases? As in the example above, the vector may contain only
two values or it may contain many more. The number of values in the
vector differ by id.

Any thoughts?

Harold

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

Making tapply code more efficient

Thread (3 messages)