Parsing JSON records to a dataframe
On 01/07/2011 12:05 AM, Dieter Menne wrote:
Jeroen Ooms wrote:
What is the most efficient method of parsing a dataframe-like structure
that has been json encoded in record-based format rather than vector
based. For example a structure like this:
[ {"name":"joe", "gender":"male", "age":41}, {"name":"anna",
"gender":"female", "age":23} ]
RJSONIO parses this as a list of lists, which I would then have to apply
as.data.frame to and append them to an existing dataframe, which is
terribly slow.
unlist is pretty fast. The solution below assumes that you know how your
structure is, so it is not very flexible, but it should show you that the
conversion to data.frame is not the bottleneck.
# json
library(RJSONIO)
# [ {"name":"joe", "gender":"male", "age":41},
# {"name":"anna", "gender":"female", "age":23} ]
n = 300000
d = data.frame(name=rep(c("joe","anna"),n),
gender=rep(c("male","female"),n),
age = rep(c("23","41"),n))
dj = toJSON(d)
This doesn't create the required structure
cat(dj)
{
"name": [ "joe", "anna", "joe", "anna" ],
"gender": [ "male", "female", "male", "female" ],
"age": [ "23", "41", "23", "41" ]
}
instead
library(rjson)
n <- 1000
name <- apply(matrix(sample(letters, n * 5, TRUE), n),
1, paste, collapse="")
gender <- sample(c("male", "female"), n, TRUE)
age <- ceiling(runif(n, 20, 60))
recs <- sprintf('{"name": "%s", "gender":"%s", "age":%d}',
name, gender, age)
j <- sprintf("[%s]", paste(recs, collapse=","))
lol <- fromJSON(j)
and then with
f <- function(lst)
function(nm) unlist(lapply(lst, "[[", nm), use.names=FALSE)
oopt <- options(stringsAsFactors=FALSE) # convenience for 'identical'
system.time({
+ df0 <- as.data.frame(Map(f(lol), names(lol[[1]]))) + }) user system elapsed 0.006 0.000 0.006 versus for instance
system.time({
+ df1 <- do.call(rbind, lapply(lol, data.frame)) + }) user system elapsed 1.497 0.000 1.500
identical(df0, df1)
[1] TRUE Martin
system.time(d1 <- fromJSON(dj))
# user system elapsed
# 4.06 0.26 4.32
system.time(
dd <- data.frame(
name = unlist(d1$name),
gender = unlist(d1$gender),
age=as.numeric(unlist(d1$age)))
)
# user system elapsed
# 1.13 0.05 1.18
Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793