on 02/11/2009 10:43 AM Johannes Habel wrote:
Hello,
I have an unbalanced panel dataset and would like to exclude all objects
that don't appear at least x times.
Therefore, I would like to include a column indicating for every line how
many periods are available, e.g.
id, year, number
1, 2000, 3
1, 2001, 3
1, 2002, 3
2, 2001, 1
3, ..., ...
This would allow me to exclude companies by setting "subset=number>=x".
However, I don't know how to create this column, i.e. how to count the years
for each object and include the numbers into the dataset.
Could anybody help me, please?
Alternatively, is there an easier way to achieve my goal?
Thank you very much.
Johannes Habel
You don't need to add the extra column. You can just create a frequency
table of the unique 'id' values, get the subset of values that meet your
count criteria and then use those values in subset().
Let's create a little larger dataset:
set.seed(1)
id <- sample(letters[1:4], 8, replace = TRUE)
[1] "b" "b" "c" "d" "a" "d" "d" "c"
years <- unlist(lapply(split(id, id),
function(i) 2000:(2000 + length(i) - 1)))
a b1 b2 c1 c2 d1 d2 d3
2000 2000 2001 2000 2001 2000 2001 2002
DF <- data.frame(id = sort(id), year = years)
id year
a a 2000
b1 b 2000
b2 b 2001
c1 c 2000
c2 c 2001
d1 d 2000
d2 d 2001
d3 d 2002
subset(DF, id %in% names(which(table(DF$id) >= 3)))
id year
d1 d 2000
d2 d 2001
d3 d 2002
Step by step:
a b c d
FALSE FALSE FALSE TRUE
names(which(table(DF$id) >= 3))
[1] "d"
Then use subset() as above, filtering only those id's that are in the
names from the table. If we change the requirement to >= 2:
subset(DF, id %in% names(which(table(DF$id) >= 2)))
id year
b1 b 2000
b2 b 2001
c1 c 2000
c2 c 2001
d1 d 2000
d2 d 2001
d3 d 2002
See ?table, ?names, ?which and ?"%in%"
If you really need to add the column, you could use aggregate() to get a
count of years for each id as a data frame, then use merge() to add the
column to DF:
aggregate(DF$year, list(id = DF$id), length)
id x
1 a 1
2 b 2
3 c 2
4 d 3
merge(DF, aggregate(DF$year, list(id = DF$id), length), by = "id")
id year x
1 a 2000 1
2 b 2000 2
3 b 2001 2
4 c 2000 2
5 c 2001 2
6 d 2000 3
7 d 2001 3
8 d 2002 3
Then use subset() as you initially considered.
See ?aggregate and ?merge
HTH,
Marc Schwartz