Skip to content

queer data set

3 messages · S.O. Nyangoma, Tony Plate, (Ted Harding)

#
I have a dataset that is basically structureless. Its dimension varies 
from row to row and sep(s) are a mixture of tab and semi colon (;) and 
example is

HEADER1 HEADER2 HEADER3   HEADER3
A1       B1      C1       X11;X12;X13
A2       B2      C2       X21;X22;X23;X24;X25
A3       B3      C3       
A4       B4      C4       X41;X42;X43
A5       B5      C5       X51

etc., say. Note that a blank under HEADER3 corresponds to non 
occurance and all semi colon (;) delimited variables are under 
HEADER3. These values run into tens of thousands. I want to give some 
order to this queer matrix to something like:

HEADER1 HEADER2 HEADER3   HEADER3
A1       B1      C1       X11
A1       B1      C1       X12
A1       B1      C1       X13
A1       B1      C1       X14
A2       B2      C2       X21
A2       B2      C2       X22
A2       B2      C2       X23
A2       B2      C2       X24
A2       B2      C2       X25
A2       B2      C2       X26
A3       B3      C3       NA
A4       B4      C4       X41
A4       B4      C4       X42
A4       B4      C4       X43

Is there a brilliant R-way of doing such task?

Goodday. Stephen.








----- Original Message -----
From: Prof Brian Ripley <ripley at stats.ox.ac.uk>
Date: Monday, August 15, 2005 11:13 pm
Subject: Re: [R] How to get a list work in RData file
#
Here's one way of working with the data you gave:

 > x <- read.table(file("clipboard"), fill=T, header=T)
 > x
   HEADER1 HEADER2 HEADER3           HEADER3.1
1      A1      B1      C1         X11;X12;X13
2      A2      B2      C2 X21;X22;X23;X24;X25
3      A3      B3      C3
4      A4      B4      C4         X41;X42;X43
5      A5      B5      C5                 X51
 > apply(x, 1, function(x) strsplit(x[4], ";")[[1]])
$"1"
[1] "X11" "X12" "X13"

$"2"
[1] "X21" "X22" "X23" "X24" "X25"

$"3"
character(0)

$"4"
[1] "X41" "X42" "X43"

$"5"
[1] "X51"

 > do.call("rbind", apply(x, 1, function(x) {
+    y <- strsplit(x[4], ";")[[1]]
+    x3 <- matrix(x[1:3], ncol=3, nrow=max(1,length(y)), byrow=T)
+    return(cbind(x3, if (length(y)) y else "NA"))
+ }))
       [,1] [,2] [,3] [,4]
  [1,] "A1" "B1" "C1" "X11"
  [2,] "A1" "B1" "C1" "X12"
  [3,] "A1" "B1" "C1" "X13"
  [4,] "A2" "B2" "C2" "X21"
  [5,] "A2" "B2" "C2" "X22"
  [6,] "A2" "B2" "C2" "X23"
  [7,] "A2" "B2" "C2" "X24"
  [8,] "A2" "B2" "C2" "X25"
  [9,] "A3" "B3" "C3" "NA"
[10,] "A4" "B4" "C4" "X41"
[11,] "A4" "B4" "C4" "X42"
[12,] "A4" "B4" "C4" "X43"
[13,] "A5" "B5" "C5" "X51"
 >

This of course is a matrix; you can convert it back to a dataframe using 
as.data.frame() if you desire.  Use either "NA" (with quotes) or NA 
(without quotes) to control whether you get just the string "NA" or an 
actual character NA value in column 4.  If you're processing a huge 
amount of data, you can probably do better by rewriting the above code 
to avoid implicit coercions of data types.

hope this helps,

Tony Plate
S.O. Nyangoma wrote:
#
On 15-Aug-05 S.O. Nyangoma wrote:
HEADER1 HEADER2 HEADER3   HEADER3
A1       B1      C1       X11;X12;X13
A2       B2      C2       X21;X22;X23;X24;X25
A3       B3      C3       
A4       B4      C4       X41;X42;X43
A5       B5      C5       X51
I don't know about a brilliant R trick (though I'm sure others do).

But (on my usual hobby-horse) if you have 'awk' available (and
don't mind using it) then it will do the job:

First create an 'awk' program file as follows:

  {for(i in A) delete A[i]}
  {if($4==""){A[1]="NA"}
    else {split($4,A,";")}}
  {B = $1 "\t" $2 "\t" $3}
  {for(i in A) print B "\t" A[i]}

and call this say split.awk

Then run

  awk -f split.awk

and feed it the lines of your primary dataset as above. Here's
a cut&paste from my Linux session, where the first block of
lines after "awk -f split.awk" are the lines being input to
the program, starting with the header, followed by the output
of the program starting with the header again:

awk -f split.awk
HEADER1 HEADER2 HEADER3   HEADER3
A1       B1      C1       X11;X12;X13
A2       B2      C2       X21;X22;X23;X24;X25
A3       B3      C3       
A4       B4      C4       X41;X42;X43
A5       B5      C5       X51
HEADER1 HEADER2 HEADER3 HEADER3
A1      B1      C1      X11
A1      B1      C1      X12
A1      B1      C1      X13
A2      B2      C2      X24
A2      B2      C2      X25
A2      B2      C2      X21
A2      B2      C2      X22
A2      B2      C2      X23
A3      B3      C3      NA
A4      B4      C4      X41
A4      B4      C4      X42
A4      B4      C4      X43
A5      B5      C5      X51

In unixoid systems, with a large file of such lines, the command
would be

  cat yourdatafile | awk -f split.awk

and then you would only see the output, not the input as shown
above, and you can of course redirect it into a new file with

  cat yourdatafile | awk -f split.awk > newdatafile

Note, however, that the order of the lines output for the third
line of input (the one with X21;X22;X23;X24;X25) is not the same
as the order of the X21;X22;X23;X24;X25 though they are all there.

This is a "feature" of the way 'awk' handles arrays (which are
"associative arrays" indexed by values, not by position).

This may not matter for your application; but if it does matter
then I'm not sure how to force the correct order.

Hoping this helps,
Ted.


--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 16-Aug-05                                       Time: 00:45:49
------------------------------ XFMail ------------------------------