How might I work with a data.frame where each physical row represents several logical rows?
On Jan 4, 2017, at 7:40 PM, David Wolfskill <r at catwhisker.org> wrote:
I have (time series) data extracted from a repository that is stored
such that each record is for an hour, but each record contains an
ordered set of values throughout the hour. In the following exmaple,
I'll show sets of 4, depicting 0, 15, 30, and 45 minutes after each
"start" point (respectively):
test_data <- structure(list(start = c(1482793200L, 1482793200L, 1482793200L,
1482793200L, 1482793200L, 1482793200L, 1482793200L, 1482793200L,
1482793200L, 1482793200L, 1482793200L, 1482793200L, 1482793200L,
1482793200L, 1482793200L), hostname = c("c001.example.net", "c001.example.net",
"c001.example.net", "c001.example.net", "c001.example.net", "c001.example.net",
"c001.example.net", "c001.example.net", "c001.example.net", "c001.example.net",
"c001.example.net", "c161.example.net", "c161.example.net", "c161.example.net",
"c161.example.net"), mtype = c("health", "health", "net", "net",
"net", "net", "net", "net", "net", "sys", "sys", "net", "sys",
"sys", "sys"), limit_type = c("fill", "serve", "", "", "", "",
"", "", "", "", "", "", "", "", ""), hw = c(1.16, 1.16, 1.16,
1.16, 1.16, 1.16, 1.16, 1.16, 1.16, 1.16, 1.16, 1.21, 1.21, 1.21,
1.21), fw = c("2017Q1.1.1", "2017Q1.1.1", "2017Q1.1.1", "2017Q1.1.1",
"2017Q1.1.1", "2017Q1.1.1", "2017Q1.1.1", "2017Q1.1.1", "2017Q1.1.1",
"2017Q1.1.1", "2017Q1.1.1", "2016Q4.2.13", "2016Q4.2.13", "2016Q4.2.13",
"2016Q4.2.13"), tcp_state = c("", "", "", "", "closed", "closing",
"fin_wait_2", "last_ack", "syn_rcvd", "", "", "", "", "", ""),
value_type = c("limit", "limit", "", "", "", "", "", "",
"", "", "", "", "", "", ""), nic = c("all", "all", "", "",
"", "", "", "", "", "", "", "mce0", "", "", ""), name = c("in_download_window",
"in_download_window", "tcpOutSegs", "tcpRetransSegs", "tcp_connection_count",
"tcp_connection_count", "tcp_connection_count", "tcp_connection_count",
"tcp_connection_count", "CpuSystem", "CpuUser", "HCOutOctets",
"CpuIdle", "CpuSystem", "CpuUser"), values = c("[0.0, 0.0, 0.0, 0.0]",
"[0.0, 0.0, 0.0, 0.0]", "[260410.94547698632, 258469.54433635762, 260579.2186617577, 258763.2815145043]",
"[18436.311524062934, 18248.952271420356, 18201.62259198662, 17818.39529178736]",
"[5.0, 3.0, 3.0, 3.0]", "[3.0, 3.0, 2.0, 2.0]", "[670.0, 677.0, 685.0, 729.0]",
"[1162.0, 1192.0, 1148.0, 1110.0]", "[25.0, 60.0, 71.0, 33.0]",
"[11.0, 10.0, 11.0, 10.0]", "[2.0, 2.0, 2.0, 2.0]", "[7.873191635959294E9, 7.7377184658927E9, 7.876630519328283E9, 7.714521544912713E9]",
"[70.0, 70.0, 70.0, 70.0]", "[27.0, 26.0, 27.0, 26.0]", "[4.0, 4.0, 4.0, 4.0]"
)), .Names = c("start", "hostname", "mtype", "limit_type",
"hw", "fw", "tcp_state", "value_type", "nic", "name", "values"
), class = "data.frame", row.names = c(NA, -15L))
So of the 15 rows in the above example, row 8 (which depicts the TCP
connection counts in the "last ACK" state) has the values:
* 1162.0
* 1192.0
* 1148.0
* 1110.0
It seems to me that what is wanted is for each of the existing rows to be
replaced by a set of 4 (in this case) rows, where the other columns are
the same (save for "start", which is a timestamp, and should be adjusted
for the respective times).
Perhaps something like this:
# function to read the values in 'values':
parse_values <- function(x) {scan(text= gsub( "\\[|\\]","",x), sep=",") }
# the apply function reads line-by-line
new_dat <- apply(test_data, 1, function(d) data.frame( as.list(d[!names(d) %in% "values"]), nvals <- parse_values(d['values']) ) )
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
# Could suppress the report from scan by adding quiet = TRUE
# now take this list of 4 line data.frames and "rbind" them
# If you wanted these to remain character you would use stringsAsFactors=FALSE in the data.frame call
new_df <- do.call("rbind", new_dat)
head(new_df)
start hostname mtype limit_type hw fw tcp_state value_type nic
1 1482793200 c001.example.net health fill 1.16 2017Q1.1.1 limit all
2 1482793200 c001.example.net health fill 1.16 2017Q1.1.1 limit all
3 1482793200 c001.example.net health fill 1.16 2017Q1.1.1 limit all
4 1482793200 c001.example.net health fill 1.16 2017Q1.1.1 limit all
5 1482793200 c001.example.net health serve 1.16 2017Q1.1.1 limit all
6 1482793200 c001.example.net health serve 1.16 2017Q1.1.1 limit all
name nvals....parse_values.d..values...
1 in_download_window 0
2 in_download_window 0
3 in_download_window 0
4 in_download_window 0
5 in_download_window 0
6 in_download_window 0
str(new_df)
'data.frame': 60 obs. of 11 variables:
$ start : Factor w/ 1 level "1482793200": 1 1 1 1 1 1 1 1 1 1 ...
$ hostname : Factor w/ 2 levels "c001.example.net",..: 1 1 1 1 1 1 1 1 1 1 ...
$ mtype : Factor w/ 3 levels "health","net",..: 1 1 1 1 1 1 1 1 2 2 ...
$ limit_type : Factor w/ 3 levels "fill","serve",..: 1 1 1 1 2 2 2 2 3 3 ...
$ hw : Factor w/ 2 levels "1.16","1.21": 1 1 1 1 1 1 1 1 1 1 ...
$ fw : Factor w/ 2 levels "2017Q1.1.1","2016Q4.2.13": 1 1 1 1 1 1 1 1 1 1 ...
$ tcp_state : Factor w/ 6 levels "","closed","closing",..: 1 1 1 1 1 1 1 1 1 1 ...
$ value_type : Factor w/ 2 levels "limit","": 1 1 1 1 1 1 1 1 2 2 ...
$ nic : Factor w/ 3 levels "all","","mce0": 1 1 1 1 1 1 1 1 2 2 ...
$ name : Factor w/ 8 levels "in_download_window",..: 1 1 1 1 1 1 1 1 2 2 ...
$ nvals....parse_values.d..values...: num 0 0 0 0 0 ...
I'm fairly sure I can write code to do that, but it would end up being something like Perl implemented in R, which seems fairly grotesque: I can't help but think that there ought to be a ... more elegant approach in R (which is why I am sking for help). (I will also end up collecting all of the records for a given timestamp and hostname, and creating one very wide record with all of the data from the set of records thus found. I already have (yes, Perl) code to do this -- though if there's a reasonable way to avoid that, I'm interested.)
I thought you wanted the data in long form.
Once that's all done, I'll be examining various columns, subsetting by attributes of the systems being compared -- but I already have code to do that (that makes use of a different -- and rather more fragile -- approach for extracting the data from its repository). Thanks! Peace, david -- David H. Wolfskill r at catwhisker.org Epistemology for post-truthers: How do we select parts of reality to ignore? See http://www.catwhisker.org/~david/publickey.gpg for my public key.
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
David Winsemius Alameda, CA, USA