Hello,
It's better if you keep this on the list, the odds of getting more and
better answers is greater.
Inline.
Em 13-07-2013 15:38, serenamasino at gmail.com escreveu:
Hi Rui,
thanks for your reply.
No, my problem isn't one of reshaping. It is just that I want R to know I have a panel and not just cross sections or time series.
In other words If I had cross section data:
COUNTRY YEAR GDP
Albania 1999 3
Barbados 1999 5
Congo 1999 1
Denmark 1999 11
etc. .. ..
My ID here is country, but every observation is a new cluster independent of each other, so I don't care to let R know because the ID is a unique identifier.
Whereas if I have a panel
COUNTRY YEAR GDP
Albania 1999 3
Albania 2000 3.5
Albania 2001 3.7
Albania 2002 4
Albania 2003 4.5
Barbados 1999 5
Barbados 2000 5
Barbados 2001 5.1
Barbados 2002 4
Barbados 2003 3
Congo 1999 1
Congo 2000 2
Congo 2001 2
Congo 2002 3
Congo 2003 4
Denmark 1999 11
Denmark 2000 12
Denmark 2001 13
Denmark 2002 10
Denmark 2003 10
etc. .. ..
How am I going to tell R that Albania is one same ID for all the 5 years I have in the panel, in other words, Albania has to be identified by the same number in the "factor" vector which R codes it with. Then Barbados is ID 2 in all its years, Congo has ID 3 and so on.
R already does that, factors are coded as integers:
as.integer(dat$COUNTRY) # Albania is 1, etc
In STATA, you sort 'by country year' and the program knows it is a panel of entities observed more than once over time. But I am not sure how to let R know the same.
In practice the reason why it is important to define where a country ends and where a new begins is because
1) if one creates lags of variables and the program doesn't know where the boundaries between countries are, the lag for the first year of Barbados in my previous example will be calculated using the last year of Albania, that is, the preceding country.
A way of doing this, equivalent to the previous line of code if the
countries are grouped consecutively, is
cumsum(c(TRUE, dat$COUNTRY[-nrow(dat)] != dat$COUNTRY[-1L]))
2) I need to create countrydummies that take the value of 1 whenever a country ID is equal to 1, so if Albania has 5 years of observations and each of the year observations appears with a different ID, the country dummies will not be created. Instead if Albania has the same country identifier (1) for all the years in which it is observed, the country dummy will be the same and ==1 whenever Albania is the country observed
I doubt you need to create dummuies, R does it for you when you create a
factor. internally, factors are coded as integers, so all you need is to
coerce them to integer like I've said earlier.
Rui Barradas
Hi,
as.integer(dat$COUNTRY) # would be the easiest (Rui's solution).
Other options could be also used:
library(plyr)
?as.integer(mapvalues(dat$COUNTRY,levels(dat$COUNTRY),seq(length(levels(dat$COUNTRY)))))
# [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
#or
match(dat$COUNTRY,levels(dat$COUNTRY))
# [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
#if `COUNTRY` is not factor
dat$COUNTRY<- as.character(dat$COUNTRY)
?as.integer(mapvalues(dat$COUNTRY,unique(dat$COUNTRY),seq(length(unique(dat$COUNTRY)))))
# [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
#or (if it is sorted already)
?(seq_along(dat$COUNTRY)-1)%/%as.vector(table(dat$COUNTRY))+1
# [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
A.K.
----- Original Message -----
From: Rui Barradas <ruipbarradas at sapo.pt>
To: serenamasino at gmail.com
Cc: 'r-help' <r-help at r-project.org>
Sent: Saturday, July 13, 2013 12:04 PM
Subject: Re: [R] How to set panel data format
Hello,
It's better if you keep this on the list, the odds of getting more and
better answers is greater.
Inline.
Em 13-07-2013 15:38, serenamasino at gmail.com escreveu:
Hi Rui,
thanks for your reply.
No, my problem isn't one of reshaping. It is just that I want R to know I have a panel and not just cross sections or time series.
In other words If I had cross section data:
COUNTRY? YEAR? GDP
Albania? ? ? ? 1999? ? 3
Barbados? ? 1999? ? 5
Congo? ? ? ? ? 1999? ? 1
Denmark? ? 1999? ? 11
etc.? ? ? ? ? ? ? ? ..? ? ? ? ? ? ..
My ID here is country, but every observation is a new cluster independent of each other, so I don't care to let R know because the ID is a unique identifier.
Whereas if I have a panel
COUNTRY? YEAR? GDP
Albania? ? ? ? 1999? ? ? 3
Albania? ? ? ? 2000? ? ? 3.5
Albania? ? ? ? 2001? ? ? 3.7
Albania? ? ? ? 2002? ? ? 4
Albania? ? ? ? 2003? ? ? 4.5
Barbados? 1999? ? ? 5
Barbados? 2000? ? ? 5
Barbados? 2001? ? ? 5.1
Barbados? 2002? ? ? 4
Barbados? 2003? ? ? 3
Congo? ? ? ? 1999? ? ? 1
Congo? ? ? ? 2000? ? ? 2
Congo? ? ? ? 2001? ? ? 2
Congo? ? ? ? 2002? ? ? 3
Congo? ? ? ? 2003? ? ? 4
Denmark? ? 1999? ? 11
Denmark? ? 2000? ? 12
Denmark? ? 2001? ? 13
Denmark? ? 2002? ? 10
Denmark? ? 2003? ? 10
etc.? ? ? ? ? ? ? ? ..? ? ? ? ? ? ..
How am I going to tell R that Albania is one same ID for all the 5 years I have in the panel, in other words, Albania has to be identified by the same number in the "factor" vector which R codes it with. Then Barbados is ID 2 in all its years, Congo has ID 3 and so on.
R already does that, factors are coded as integers:
as.integer(dat$COUNTRY) # Albania is 1, etc
In STATA, you sort 'by country year' and the program knows it is a panel of entities observed more than once over time.? But I am not sure how to let R know the same.
In practice the reason why it is important to define where a country ends and where a new begins is because
1) if one creates lags of variables and the program doesn't know where the boundaries between countries are, the lag for the first year of Barbados in my previous example will be calculated using the last year of Albania, that is, the preceding country.
A way of doing this, equivalent to the previous line of code if the
countries are grouped consecutively, is
cumsum(c(TRUE, dat$COUNTRY[-nrow(dat)] != dat$COUNTRY[-1L]))
2) I need to create countrydummies that take the value of 1 whenever a country ID is equal to 1, so if Albania has 5 years of observations and each of the year observations appears with a different ID, the country dummies will not be created. Instead if Albania has the same country identifier (1) for all the years in which it is observed, the country dummy will be the same and ==1 whenever Albania is the country observed
I doubt you need to create dummuies, R does it for you when you create a
factor. internally, factors are coded as integers, so all you need is to
coerce them to integer like I've said earlier.
Rui Barradas