JDataFrame API
Hi Simon,
Aha! I re-read your message and noticed this line:
lapply(J("A")$direct(), .jevalArray)
which I had overlooked earlier. I wrote an example that is very
similar to yours and see what you mean now regarding how we can do
this directly.
Many thanks,
T
groovyScript <- paste (
"def stringList = [] as java.util.List",
"def numberList = [] as java.util.List",
"for (def ctr in 0..99) { stringList << new String(\"TGIF $ctr\");
numberList << ctr; }",
"def strings = stringList.toArray()",
"def numbers = numberList.toArray()",
"def result = [strings, numbers]",
"return (Object[]) result",
sep="\n")
result <- Evaluate (groovyScript=groovyScript)
temp <- lapply(result, .jevalArray)
On Fri, Jan 15, 2016 at 1:58 PM, Simon Urbanek
<simon.urbanek at r-project.org> wrote:
On Jan 15, 2016, at 12:35 PM, Thomas Fuller <thomas.fuller at coherentlogic.com> wrote: Hi Simon, Thanks for your feedback. -- this is an observation that I wasn't considering when I wrote this mainly because I am, in fact, working with rather small data sets. BTW: There is code there, it's under the bitbucket link -- here's the direct link if you'd still like to look at it: https://bitbucket.org/CoherentLogic/jdataframe
Ah, sorry, all links just send you back to the page, so I missed the little filed that tells you how to check it out.
Re "for practical purposes is doesn't seem like the most efficient solution" and "So the JSON route is very roughly ~13x slower than using Java directly." I've not benchmarked this and will take a closer look at what you have today -- in fact I may include these details on the JDataFrame page. The JDataFrame targets the use case where there's significant development being done in Java and data is exported into R and, additionally, the developer intends to keep the two separated as much as possible. I could work with Java directly, but then I potentially end up with quite a bit of Java code taking up space in R and I don't like this because if I need to refactor something I have to do it in two places.
No, the code is the same - it makes no difference. The R code is only one call to fetch what you need by calling your Java method. The nice thing is that you in fact save some code since there is no reason to serialize since you can simply access all Java objects directly without serialization.
There's another use case for the JDataFrame as well and that's in an enterprise application (you may have alluded to this when you said "[i]f you need process separation..."). Consider a business where users are working with R and the application that produces the data is actually running in Tomcat. Shipping large amounts of data over the wire in this example would be a performance destroyer, but for small data sets it certainly would be helpful from a development perspective to expose JSON-based web services where the R script would be able to convert a result into a data frame gracefully.
Yes, sure, that makes sense. Like I said, I would probably use some native format in that case if I worried about performance. Some candidates that come to my mind are ProtoBuf and QAP (serialization used by Rserve). If you have arrays, you can always serialize them directly which may be most efficient, but you'd probably have to write the wrapper for that yourself (annoyingly, the default Java methods use big-endian format which is slower on most machines). But then, you're right that for Tomcat applications the sizes are small enough that using JSON has the benefit that you can inspect payload by eye and/or other tools very easily. Cheers, Simon
On Fri, Jan 15, 2016 at 10:58 AM, Simon Urbanek <simon.urbanek at r-project.org> wrote:
Tom, this may be good for embedding small data sets, but for practical purposes is doesn't seem like the most efficient solution. Since you didn't provide any code, I built a test case using the build-in Java JSON API to build a medium-sized dataset (1e6 rows) and read it in just to get a ballpark (see https://gist.github.com/s-u/4efb284e3c15c6a2db16 # generate: time java -cp .:javax.json-api-1.0.jar:javax.json-1.0.4.jar A > 1e6 real 0m2.764s user 0m20.356s sys 0m0.962s # read:
system.time(temp <- RJSONIO::fromJSON("1e6"))
user system elapsed 3.484 0.279 3.834
str(temp)
List of 2 $ V1: num [1:1000000] 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 ... $ V2: chr [1:1000000] "X0" "X1" "X2" "X3" ... For comparison using Java directly (includes both generation and reading into R):
system.time(temp <- lapply(J("A")$direct(), .jevalArray))
user system elapsed 0.962 0.186 0.494 So the JSON route is very roughly ~13x slower than using Java directly. Obviously, this will vary by data set type etc. since there is R overhead involved as well: for example, if you have only numeric variables, the JSON route is 30x slower on reading alone [50x total]. String variables slow down everyone equally. Interestingly, the JSON encoding is using all 16 cores, so the 2.7s real time add up to over 20s CPU time so on smaller machines you may see more overhead. If you need process separation, it may be a different story - in principle it is faster to use more native serialization than JSON since parsing is the slowest part for big datasets. Cheers, Simon
On Jan 14, 2016, at 4:52 PM, Thomas Fuller <thomas.fuller at coherentlogic.com> wrote: Hi Folks, If you need to send data from Java to R you may consider using the JDataFrame API -- which is used to convert data into JSON which then can be converted into a data frame in R. Here's the project page: https://coherentlogic.com/middleware-development/jdataframe/ and here's a partial example which demonstrates what the API looks like: String result = new JDataFrameBuilder() .addColumn("Code", new Object[] {"WV", "VA", }) .addColumn("Description", new Object[] {"West Virginia", "Virginia"}) .toJson(); and in R script we would need to do this: temp <- RJSONIO::fromJSON(json) tempDF <- as.data.frame(temp) which yields a data frame that looks like this:
tempDF
Description Code 1 West Virginia WV 2 Virginia VA It is my intention to deploy this project to Maven Central this week, time permitting. Questions and comments are welcomed. Tom
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel