Skip to content
Prev 392203 / 398502 Next

How to parse a really silly date with lubridate

Eberhard,

Others have supplied ways to do this using various date management functions. But I want to add another option that may make sense if the dates are not all quite as predictable.

You can use your own regular expressions in R as in most languages, that try to match each entry to one or more patterns you supply. When it matches a pattern, you can capture various bits of what you find and recombine them into any valid date format you want or convert them into some date type.

You just showed two pseudo-date-time entries:

	 1 9. Jul 2022 at 11:39
	 2 10. Jul 2022 at 01:58

It looks like the first item in each is a sequence number that can be ignored. But I assume after the ninth, it takes up ever more space as it becomes 10 then 100 and so on. It would thus be of varying width. So, in English, you are looking for a pattern like:

- beginning of the line/entry
- one or more digits in 0-9 followed by what looks like a space, but maybe arbitrary whitespace.
- capture one or two more digits if followed by a period, but not the period.
- capture what looks like a three-letter character string representing a month as a "word" but not what follows.
- whitespace followed by "at" followed by more whitespace can be ignored but must be present.
- capture a two digit number perhaps including the colon and another two-digit number, or two parts without the colon.
- match the end of the entry

If it looks like the above, and you captured the parts needed, you can reconstitute the parts, albeit perhaps first checking to see if the month names are valid (which can be handled by a pattern part that looks like "(Jan|Feb|...|Dec)" and of course making sure the other parts also fit requirements such as days as integers being between 1 and 31 and similar issues about time. My guess is that you don't really want to validate that deeply, let alone do the validation within the regular expression.

But when your data ends up having multiple ways of saying the date that you can enumerate and tell apart, this can be a way to parse them and identify what it matches and so on.

In the above case, full regular expressions may be overkill nd regular string functions might suffice.  There are many ways, including something as simple as this:

dt <- "1 9. Jul 2022 at 11:39"
dt_parts <- strsplit(dt, " ")

Unfortunately, this returns a vector of parts with a decimal point still in place:
[1] "1"     "9."    "Jul"   "2022"  "at"    "11:39"

You can now ignore  the parts in the first fifth positions as the sequence number and "at" are meaningless and use these:

the_day <- dt_parts[2]
the_month <- dt_parts[3]
the_year <- dt_parts[4]
the_time <- dt_parts[6]

Of course, you now may want further processing using one of many ways to turn something like "9." (note there may be a "12." too) or even make it an integer instead of a character string. You may want to change the dates into numerals with something like this:

month_conversion <- list("Jan"="01", Feb="02", Mar="03", Apr="04", May="05", Jun="06", 
                         Jul="07", Aug="08", Sep="09", Oct="10", Nov="11", Dec="12")
month_numeric_string <- month_conversion[the_month]


My point is that if your format is consistent, you can pick out the parts you want as strings or make them decimals or whatever you need and use string functions, such as paste0() to recombine them into the format you want. 

Of course if you can tell some date-related function how to do it for you, that is obviously simpler and better for you.

And, obviously, it would be even nicer if you could get them to not give dates in this format!

-----Original Message-----
From: R-help <r-help-bounces at r-project.org> On Behalf Of Rui Barradas
Sent: Wednesday, July 13, 2022 10:48 AM
To: Dr Eberhard Lisse <nospam at lisse.NA>; r-help at r-project.org
Subject: Re: [R] How to parse a really silly date with lubridate

Hello,

With lubridate, note in what sequence your datetime elements occur and use the appropriate function.


d <- c('9. Jul 2022 at 11:39', '10. Jul 2022 at 01:58')
lubridate::dmy_hm(d)
#> [1] "2022-07-09 11:39:00 UTC" "2022-07-10 01:58:00 UTC"


Hope this helps,

Rui Barradas

?s 14:40 de 13/07/2022, Dr Eberhard Lisse escreveu:
______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.