Skip to content

How might I work with a data.frame where each physical row represents several logical rows?

5 messages · David Winsemius, David Wolfskill, Jeff Newmiller

#
I have (time series) data extracted from a repository that is stored
such that each record is for an hour, but each record contains an
ordered set of values throughout the hour.  In the following exmaple,
I'll show sets of 4, depicting 0, 15, 30, and 45 minutes after each
"start" point (respectively):

test_data <- structure(list(start = c(1482793200L, 1482793200L, 1482793200L, 
1482793200L, 1482793200L, 1482793200L, 1482793200L, 1482793200L, 
1482793200L, 1482793200L, 1482793200L, 1482793200L, 1482793200L, 
1482793200L, 1482793200L), hostname = c("c001.example.net", "c001.example.net", 
"c001.example.net", "c001.example.net", "c001.example.net", "c001.example.net", 
"c001.example.net", "c001.example.net", "c001.example.net", "c001.example.net", 
"c001.example.net", "c161.example.net", "c161.example.net", "c161.example.net", 
"c161.example.net"), mtype = c("health", "health", "net", "net", 
"net", "net", "net", "net", "net", "sys", "sys", "net", "sys", 
"sys", "sys"), limit_type = c("fill", "serve", "", "", "", "", 
"", "", "", "", "", "", "", "", ""), hw = c(1.16, 1.16, 1.16, 
1.16, 1.16, 1.16, 1.16, 1.16, 1.16, 1.16, 1.16, 1.21, 1.21, 1.21, 
1.21), fw = c("2017Q1.1.1", "2017Q1.1.1", "2017Q1.1.1", "2017Q1.1.1", 
"2017Q1.1.1", "2017Q1.1.1", "2017Q1.1.1", "2017Q1.1.1", "2017Q1.1.1", 
"2017Q1.1.1", "2017Q1.1.1", "2016Q4.2.13", "2016Q4.2.13", "2016Q4.2.13", 
"2016Q4.2.13"), tcp_state = c("", "", "", "", "closed", "closing", 
"fin_wait_2", "last_ack", "syn_rcvd", "", "", "", "", "", ""), 
    value_type = c("limit", "limit", "", "", "", "", "", "", 
    "", "", "", "", "", "", ""), nic = c("all", "all", "", "", 
    "", "", "", "", "", "", "", "mce0", "", "", ""), name = c("in_download_window", 
    "in_download_window", "tcpOutSegs", "tcpRetransSegs", "tcp_connection_count", 
    "tcp_connection_count", "tcp_connection_count", "tcp_connection_count", 
    "tcp_connection_count", "CpuSystem", "CpuUser", "HCOutOctets", 
    "CpuIdle", "CpuSystem", "CpuUser"), values = c("[0.0, 0.0, 0.0, 0.0]", 
    "[0.0, 0.0, 0.0, 0.0]", "[260410.94547698632, 258469.54433635762, 260579.2186617577, 258763.2815145043]", 
    "[18436.311524062934, 18248.952271420356, 18201.62259198662, 17818.39529178736]", 
    "[5.0, 3.0, 3.0, 3.0]", "[3.0, 3.0, 2.0, 2.0]", "[670.0, 677.0, 685.0, 729.0]", 
    "[1162.0, 1192.0, 1148.0, 1110.0]", "[25.0, 60.0, 71.0, 33.0]", 
    "[11.0, 10.0, 11.0, 10.0]", "[2.0, 2.0, 2.0, 2.0]", "[7.873191635959294E9, 7.7377184658927E9, 7.876630519328283E9, 7.714521544912713E9]", 
    "[70.0, 70.0, 70.0, 70.0]", "[27.0, 26.0, 27.0, 26.0]", "[4.0, 4.0, 4.0, 4.0]"
    )), .Names = c("start", "hostname", "mtype", "limit_type", 
"hw", "fw", "tcp_state", "value_type", "nic", "name", "values"
), class = "data.frame", row.names = c(NA, -15L))


So of the 15 rows in the above example, row 8 (which depicts the TCP
connection counts in the "last ACK" state) has the values:

* 1162.0
* 1192.0
* 1148.0
* 1110.0

It seems to me that what is wanted is for each of the existing rows to be
replaced by a set of 4 (in this case) rows, where the other columns are
the same (save for "start", which is a timestamp, and should be adjusted
for the respective times).

I'm fairly sure I can write code to do that, but it would end up being
something like Perl implemented in R, which seems fairly grotesque: I
can't help but think that there ought to be a ... more elegant approach
in R (which is why I am sking for help).

(I will also end up collecting all of the records for a given timestamp
and hostname, and creating one very wide record with all of the data
from the set of records thus found.  I already have (yes, Perl) code to
do this -- though if there's a reasonable way to avoid that, I'm
interested.)

Once that's all done, I'll be examining various columns, subsetting
by attributes of the systems being compared -- but I already have
code to do that (that makes use of a different -- and rather more
fragile -- approach for extracting the data from its repository).

Thanks!

Peace,
david
#
Perhaps something like this:

# function to read the values in 'values':
 parse_values <- function(x) {scan(text= gsub( "\\[|\\]","",x), sep=",") }

# the apply function reads line-by-line
 new_dat <- apply(test_data, 1, function(d) data.frame( as.list(d[!names(d)  %in% "values"]), nvals <- parse_values(d['values']) ) )
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items

# Could suppress the report from scan by adding quiet = TRUE
# now take this list of 4 line data.frames and "rbind" them
# If you wanted these to remain character you would use stringsAsFactors=FALSE in the data.frame call
start         hostname  mtype limit_type   hw         fw tcp_state value_type nic
1 1482793200 c001.example.net health       fill 1.16 2017Q1.1.1                limit all
2 1482793200 c001.example.net health       fill 1.16 2017Q1.1.1                limit all
3 1482793200 c001.example.net health       fill 1.16 2017Q1.1.1                limit all
4 1482793200 c001.example.net health       fill 1.16 2017Q1.1.1                limit all
5 1482793200 c001.example.net health      serve 1.16 2017Q1.1.1                limit all
6 1482793200 c001.example.net health      serve 1.16 2017Q1.1.1                limit all
                name nvals....parse_values.d..values...
1 in_download_window                                  0
2 in_download_window                                  0
3 in_download_window                                  0
4 in_download_window                                  0
5 in_download_window                                  0
6 in_download_window                                  0

str(new_df)
'data.frame':	60 obs. of  11 variables:
 $ start                             : Factor w/ 1 level "1482793200": 1 1 1 1 1 1 1 1 1 1 ...
 $ hostname                          : Factor w/ 2 levels "c001.example.net",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ mtype                             : Factor w/ 3 levels "health","net",..: 1 1 1 1 1 1 1 1 2 2 ...
 $ limit_type                        : Factor w/ 3 levels "fill","serve",..: 1 1 1 1 2 2 2 2 3 3 ...
 $ hw                                : Factor w/ 2 levels "1.16","1.21": 1 1 1 1 1 1 1 1 1 1 ...
 $ fw                                : Factor w/ 2 levels "2017Q1.1.1","2016Q4.2.13": 1 1 1 1 1 1 1 1 1 1 ...
 $ tcp_state                         : Factor w/ 6 levels "","closed","closing",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ value_type                        : Factor w/ 2 levels "limit","": 1 1 1 1 1 1 1 1 2 2 ...
 $ nic                               : Factor w/ 3 levels "all","","mce0": 1 1 1 1 1 1 1 1 2 2 ...
 $ name                              : Factor w/ 8 levels "in_download_window",..: 1 1 1 1 1 1 1 1 2 2 ...
 $ nvals....parse_values.d..values...: num  0 0 0 0 0 ...
I thought you wanted the data in long form.
David Winsemius
Alameda, CA, USA
#
On Wed, Jan 04, 2017 at 08:33:46PM -0800, David Winsemius wrote:
Hmmm.... OK; that looks a lot better than the stuff that was coming to
my mind -- thanks! :-)
Aye.
Sorry; I'm not understanding what you mean: My background is a lot more
toward systems administration than statistical analysis.

The repository I'm using has a rather large number of individual metrics
from a given server -- each provided on a separate row.  (That's why one
of the columns is called "name" -- it provides the (base) "name" of the
metric that corresponds to the "values" on the given row.)  I'll plan to
assemble the rows for a given server & timestamp into a single row --
thuse, I would have the tcp_connection_count for the "last ACK" state
and for the "fin_wait_2" state, as well as CpuSystem, CpuUser, CpuIdle,
... for the given server & timestamp on a single row (eventually).
Thanks again!

Peace,
david
#
If you wanted it in wide form, you could just join the individual id columns to a named list made from the parsed 'values'.
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
Read 4 items
'data.frame':	15 obs. of  14 variables:
 $ start     : Factor w/ 1 level "1482793200": 1 1 1 1 1 1 1 1 1 1 ...
 $ hostname  : Factor w/ 2 levels "c001.example.net",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ mtype     : Factor w/ 3 levels "health","net",..: 1 1 2 2 2 2 2 2 2 3 ...
 $ limit_type: Factor w/ 3 levels "fill","serve",..: 1 2 3 3 3 3 3 3 3 3 ...
 $ hw        : Factor w/ 2 levels "1.16","1.21": 1 1 1 1 1 1 1 1 1 1 ...
 $ fw        : Factor w/ 2 levels "2017Q1.1.1","2016Q4.2.13": 1 1 1 1 1 1 1 1 1 1 ...
 $ tcp_state : Factor w/ 6 levels "","closed","closing",..: 1 1 1 1 2 3 4 5 6 1 ...
 $ value_type: Factor w/ 2 levels "limit","": 1 1 2 2 2 2 2 2 2 2 ...
 $ nic       : Factor w/ 3 levels "all","","mce0": 1 1 2 2 2 2 2 2 2 2 ...
 $ name      : Factor w/ 8 levels "in_download_window",..: 1 1 2 3 4 4 4 4 4 5 ...
 $ V1        : num  0 0 260411 18436 5 ...
 $ V2        : num  0 0 258470 18249 3 ...
 $ V3        : num  0 0 260579 18202 3 ...
 $ V4        : num  0 0 258763 17818 3 ...
David Winsemius
Alameda, CA, USA
#
Below is my interpretation of one way to achieve your stated goal. I don't 
know what analysis you plan to perform, but this seems unlikely to be be 
my preferred analysis format (I think I would probably analyze subsets of 
the records related to specific parts of the transactions).

library(dplyr)
library(tidyr)

fifteenminutes <- 15 * 60

# for education, show intermediate results
# strip out square brackets
View(   test_data
     %>% mutate( values = sub( "^\\[(.*)\\]$", "\\1", values ) )
     )

# split the single column into multiple columns
View(   test_data
     %>% mutate( values = sub( "^\\[(.*)\\]$", "\\1", values ) )
     %>% separate( values, paste0( "value", 0:3 ), ", *" )
     )

# pull separate value columns into one column called value, with a new
# column vcol to hold the name of the original column
View(   test_data
     %>% mutate( values = sub( "^\\[(.*)\\]$", "\\1", values ) )
     %>% separate( values, paste0( "value", 0:3 ), ", *" )
     %>% gather( vcol, value, c( value0, value1, value2, value3 ) )
     )

# create a timestamp column for the individual values
View(   test_data
     %>% mutate( values = sub( "^\\[(.*)\\]$", "\\1", values ) )
     %>% separate( values, paste0( "value", 0:3 ), ", *" )
     %>% gather( vcol, value, c( value0, value1, value2, value3 ) )
     %>% mutate( timestamp = start
                           + fifteenminutes
                             * as.numeric( sub( "value", "", vcol ) ) )
     )

# remove the old vcol column now that timestamp column is created
View(   test_data
     %>% mutate( values = sub( "^\\[(.*)\\]$", "\\1", values ) )
     %>% separate( values, paste0( "value", 0:3 ), ", *" )
     %>% gather( vcol, value, c( value0, value1, value2, value3 ) )
     %>% mutate( timestamp = start
                           + fifteenminutes
                             * as.numeric( sub( "value", "", vcol ) ) )
     %>% select( -vcol )
     )

# unite several columns that currently distinguish various rows
View(   test_data
     %>% mutate( values = sub( "^\\[(.*)\\]$", "\\1", values ) )
     %>% separate( values, paste0( "value", 0:3 ), ", *" )
     %>% gather( vcol, value, c( value0, value1, value2, value3 ) )
     %>% mutate( timestamp = start
                           + fifteenminutes
                             * as.numeric( sub( "value", "", vcol ) ) )
     %>% select( -vcol )
     %>% unite( mname, mtype, nic, tcp_state, limit_type, value_type, name )
     )

# spread values out into separate columns
test_data2 <- (   test_data
               %>% mutate( values = sub( "^\\[(.*)\\]$", "\\1", values ) )
               %>% separate( values, paste0( "value", 0:3 ), ", *" )
               %>% gather( vcol, value, c( value0, value1, value2, value3 ) 
)
               %>% mutate( timestamp = start + fifteenminutes * as.numeric( 
sub( "value", "", vcol ) ) )
               %>% select( -vcol )
               %>% unite( mname, mtype, nic, tcp_state, limit_type, 
value_type, name )
               %>% spread( mname, value )
               )

View( test_data2 )
On Wed, 4 Jan 2017, David Wolfskill wrote:

            
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                       Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k