Dear R-Helpers: I'm having a problem getting dates into the correct format. I have a data frame, which is based on a .csv file that I imported into R via read.table. R has converted my date variables to factors; when I use the as.Date command, most of the values are converted "correctly" (and by this I guess I mean converted "as I wish them to be") but some have not been. Here's what I have: str(pk.df) 'data.frame': 206 obs. of 134 variables: $ uniqid : int 010 015 120 130 210 245 320 330 415 ... $ st_date : Factor w/ 154 levels "01/01/48","01/01/51",..: 46 27 NA 12 118 NA 63 127 NA NA ... ... I then convert them to a date class using st_date.new<-as.Date(st_date, "%m/%d/%y") This _seems_ to work... str(st_date.new) Class 'Date' num [1:206] 8150 8466 NA 33982 10149 ... But notice the 4th observation; I would like it to be 1963, not 2063. st_date.new[1:10] [1] "1992-04-25" "1993-03-07" NA "2063-01-15" "1997-10-15" [6] NA "1991-05-31" "1994-11-20" NA NA st_date[1:10] [1] 04/25/92 03/07/93 <NA> 01/15/63 10/15/97 <NA> 05/31/91 [8] 11/20/94 <NA> <NA> 154 Levels: 01/01/48 01/01/51 01/01/52 01/01/59 01/01/63 ... 12/31/96 I thought that the problem might be that I was converting a factor, so I first converted the variable to a character type (although I understand that this is done automatically) and then to date class, but I still had the same problem. Does anybody know how I can solve this and why I am getting this behavior? One more tidbit: the earliest date for which the date conversion is "correct" is 1969-04-15, while the most recent date for which the century is "incorrect" is 1967-11-05. Thanks, Josip Research Associate Human Security Report Project School for International Studies Simon Fraser University Suite 7200--515 W. Hastings St. Vancouver, BC V6B 5K3 Canada
Confusion with Converting Factors to Dates using as.date
4 messages · Peter Dalgaard, Marc Schwartz, Josip Dasovic
Josip Dasovic wrote:
Dear R-Helpers: I'm having a problem getting dates into the correct format. I have a data frame, which is based on a .csv file that I imported into R via read.table. R has converted my date variables to factors; when I use the as.Date command, most of the values are converted "correctly" (and by this I guess I mean converted "as I wish them to be") but some have not been. Here's what I have: str(pk.df) 'data.frame': 206 obs. of 134 variables: $ uniqid : int 010 015 120 130 210 245 320 330 415 ... $ st_date : Factor w/ 154 levels "01/01/48","01/01/51",..: 46 27 NA 12 118 NA 63 127 NA NA ... ... I then convert them to a date class using st_date.new<-as.Date(st_date, "%m/%d/%y") This _seems_ to work... str(st_date.new) Class 'Date' num [1:206] 8150 8466 NA 33982 10149 ... But notice the 4th observation; I would like it to be 1963, not 2063. st_date.new[1:10] [1] "1992-04-25" "1993-03-07" NA "2063-01-15" "1997-10-15" [6] NA "1991-05-31" "1994-11-20" NA NA st_date[1:10] [1] 04/25/92 03/07/93 <NA> 01/15/63 10/15/97 <NA> 05/31/91 [8] 11/20/94 <NA> <NA> 154 Levels: 01/01/48 01/01/51 01/01/52 01/01/59 01/01/63 ... 12/31/96 I thought that the problem might be that I was converting a factor, so I first converted the variable to a character type (although I understand that this is done automatically) and then to date class, but I still had the same problem. Does anybody know how I can solve this and why I am getting this behavior? One more tidbit: the earliest date for which the date conversion is "correct" is 1969-04-15, while the most recent date for which the century is "incorrect" is 1967-11-05.
Well, to quote ?strptime:
'%y' Year without century (00-99). If you use this on input, which
century you get is system-specific. So don't! Often values
up to 68 (or 69) are prefixed by 20 and 69 (or 70) to 99 by
19.
O__ ---- Peter Dalgaard ?ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
on 12/10/2008 02:41 PM Josip Dasovic wrote:
Dear R-Helpers: I'm having a problem getting dates into the correct format. I have a data frame, which is based on a .csv file that I imported into R via read.table. R has converted my date variables to factors; when I use the as.Date command, most of the values are converted "correctly" (and by this I guess I mean converted "as I wish them to be") but some have not been. Here's what I have: str(pk.df) 'data.frame': 206 obs. of 134 variables: $ uniqid : int 010 015 120 130 210 245 320 330 415 ... $ st_date : Factor w/ 154 levels "01/01/48","01/01/51",..: 46 27 NA 12 118 NA 63 127 NA NA ... ... I then convert them to a date class using st_date.new<-as.Date(st_date, "%m/%d/%y") This _seems_ to work... str(st_date.new) Class 'Date' num [1:206] 8150 8466 NA 33982 10149 ... But notice the 4th observation; I would like it to be 1963, not 2063. st_date.new[1:10] [1] "1992-04-25" "1993-03-07" NA "2063-01-15" "1997-10-15" [6] NA "1991-05-31" "1994-11-20" NA NA st_date[1:10] [1] 04/25/92 03/07/93 <NA> 01/15/63 10/15/97 <NA> 05/31/91 [8] 11/20/94 <NA> <NA> 154 Levels: 01/01/48 01/01/51 01/01/52 01/01/59 01/01/63 ... 12/31/96 I thought that the problem might be that I was converting a factor, so I first converted the variable to a character type (although I understand that this is done automatically) and then to date class, but I still had the same problem. Does anybody know how I can solve this and why I am getting this behavior? One more tidbit: the earliest date for which the date conversion is "correct" is 1969-04-15, while the most recent date for which the century is "incorrect" is 1967-11-05. Thanks, Josip
This is the consequence of using a two digit year rather than a four
digit year, which BTW, was one of the Y2K issues raised a decade ago...
As per ?strptime:
%y
Year without century (00?99). If you use this on input, which
century you get is system-specific. So don't! Often values up to 68 (or
69) are prefixed by 20 and 69 (or 70) to 99 by 19.
If you know that all of your dates are going to be before 2000, you can
do the following, by using a regex to convert the two digit year to a
four digit year and then use as.Date() with '%Y':
st_date <- "01/15/63"
sub("([0-9]{2})$", "19\\1", st_date)
[1] "01/15/1963"
as.Date(sub("([0-9]{2})$", "19\\1", st_date), format = "%m/%d/%Y")
[1] "1963-01-15" The better option is to ensure that the source of your data outputs or exports dates with a four digit year, before importing into R. See ?sub and ?regex HTH, Marc Schwartz
Thank you very much, Peter. As is often the case, R gave me exactly what I asked it to give me, but not what I wanted it to give me. :) Cheers, Josip Research Associate Human Security Report Project School for International Studies Simon Fraser University Suite 7200--515 W. Hastings St. Vancouver, BC V6B 5K3 Canada ----- Original Message ----- From: "Peter Dalgaard" <p.dalgaard at biostat.ku.dk> To: "Josip Dasovic" <j_dasovic at sfu.ca> Cc: r-help at r-project.org Sent: Wednesday, December 10, 2008 1:16:48 PM GMT -08:00 US/Canada Pacific Subject: Re: [R] Confusion with Converting Factors to Dates using as.date
Josip Dasovic wrote:
Dear R-Helpers: I'm having a problem getting dates into the correct format. I have a data frame, which is based on a .csv file that I imported into R via read.table. R has converted my date variables to factors; when I use the as.Date command, most of the values are converted "correctly" (and by this I guess I mean converted "as I wish them to be") but some have not been. Here's what I have: str(pk.df) 'data.frame': 206 obs. of 134 variables: $ uniqid : int 010 015 120 130 210 245 320 330 415 ... $ st_date : Factor w/ 154 levels "01/01/48","01/01/51",..: 46 27 NA 12 118 NA 63 127 NA NA ... ... I then convert them to a date class using st_date.new<-as.Date(st_date, "%m/%d/%y") This _seems_ to work... str(st_date.new) Class 'Date' num [1:206] 8150 8466 NA 33982 10149 ... But notice the 4th observation; I would like it to be 1963, not 2063. st_date.new[1:10] [1] "1992-04-25" "1993-03-07" NA "2063-01-15" "1997-10-15" [6] NA "1991-05-31" "1994-11-20" NA NA st_date[1:10] [1] 04/25/92 03/07/93 <NA> 01/15/63 10/15/97 <NA> 05/31/91 [8] 11/20/94 <NA> <NA> 154 Levels: 01/01/48 01/01/51 01/01/52 01/01/59 01/01/63 ... 12/31/96 I thought that the problem might be that I was converting a factor, so I first converted the variable to a character type (although I understand that this is done automatically) and then to date class, but I still had the same problem. Does anybody know how I can solve this and why I am getting this behavior? One more tidbit: the earliest date for which the date conversion is "correct" is 1969-04-15, while the most recent date for which the century is "incorrect" is 1967-11-05.
Well, to quote ?strptime:
'%y' Year without century (00-99). If you use this on input, which
century you get is system-specific. So don't! Often values
up to 68 (or 69) are prefixed by 20 and 69 (or 70) to 99 by
19.
O__ ---- Peter Dalgaard ?ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907