Greetings, all.
I've got a datafile I've been working with that has an ideosyncratic,
heterogeneous format. It's grossly like:
[...]
DISKREAD,metadata about disks
MEM,metadata about memory
ZZZZ,observation-identifier,time,date
DISKREAD,observation-identifier,data about disks
MEM,observation-identifier,data about memory
[ and repeat for each observation ]
What I've done in the past was take the monolithic file, and
preprocess it into files, one per observation type. The observation
types are structurally self-similar, so once I have them split up,
normal read.csv methods work just fine. Then I read the ZZZZ file to
get timestamps, and whichever observation files I care about on this
run.
But ideally, I'd like to do this entire operation with R features, and
without multiple passes through the file.
The line lengths vary wildly, so a read.table doesn't help.
I was visualizing the following:
+ create a FIFO for each desired observation class, including the ZZZZ metadata
+ In one pass through the source file, populate the FIFOs with their data
+ read.csv the output sides of the FIFOs.
But I have problems right out of the gate: when I set a data.frame
element to the output of fifo(), what actually gets inserted seems to
be an integer; I am guessing it's being turned into a factor.
example:
----
desired_slices=c("ZZZZ","DISKWRITE")
temps = data.frame(slice=desired_slices,row.names=1,handle=I(""))
temps["ZZZZ",] = fifo("./ZZZZ",open="w+")
showConnections()
( you can see that the connection is open)
temps
( you can see that the contents of the data.frame cell is the filehandle number)
-----
Am I just barking up the wrong tree?
- Allen S. Rout
reading heterogeneous CSV
2 messages · Allen S. Rout, Gabor Grothendieck
This will read it in all in and then you can decide what you want to do with it: Lines <- "DISKREAD,metadata about disks MEM,metadata about memory ZZZZ,observation-identifier,time,date DISKREAD,observation-identifier,data about disks MEM,observation-identifier,data about memory" DF <- read.table(textConnection(Lines), sep = ",", fill = TRUE)
On Tue, Aug 11, 2009 at 2:55 PM, Allen S. Rout<asr at ufl.edu> wrote:
Greetings, all.
I've got a datafile I've been working with that has an ideosyncratic,
heterogeneous format. ?It's grossly like:
[...]
DISKREAD,metadata about disks
MEM,metadata about memory
ZZZZ,observation-identifier,time,date
DISKREAD,observation-identifier,data about disks
MEM,observation-identifier,data about memory
[ and repeat for each observation ]
What I've done in the past was take the monolithic file, and
preprocess it into files, one per observation type. ?The observation
types are structurally self-similar, so once I have them split up,
normal read.csv methods work just fine. ?Then I read the ZZZZ file to
get timestamps, and whichever observation files I care about on this
run.
But ideally, I'd like to do this entire operation with R features, and
without multiple passes through the file.
The line lengths vary wildly, so a read.table doesn't help.
I was visualizing the following:
+ create a FIFO for each desired observation class, including the ZZZZ metadata
+ In one pass through the source file, populate the FIFOs with their data
+ read.csv the output sides of the FIFOs.
But I have problems right out of the gate: when I set a data.frame
element to the output of fifo(), what actually gets inserted seems to
be an integer; I am guessing it's being turned into a factor.
example:
----
desired_slices=c("ZZZZ","DISKWRITE")
temps = data.frame(slice=desired_slices,row.names=1,handle=I(""))
temps["ZZZZ",] = fifo("./ZZZZ",open="w+")
showConnections()
?( you can see that the connection is open)
temps
?( you can see that the contents of the data.frame cell is the filehandle number)
-----
Am I just barking up the wrong tree?
- Allen S. Rout
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.