how to convert a data.frame to tree structure object such as dendrogram - R-help

Wed, Mar 13, 2013 1:12 PM #

Here is a simpler, less clumsy version of my previous recursive R
solution that I sent you privately, which I'll also cc to the list
this time. It's now almost a one-liner.

To avoid problems with unused factor levels, I still prefer to have
character vectors not factors, as the data frame columns so:

df <- data.frame(a=c('A','A', 'A', 'B','B','C','C','C'), b=c('Aa',
'Ab','Ab','Ba','Bd', 'C1','C2','C3'), c=c('Aa1', 'Ab1', 'Ab2', 'Ba1',
'Bd2', 'C11','C12','C13'), stringsAsFactors=FALSE)

makeTree2 <-function(x, i,n)
{
  if(i==n)df[x,i]
  else {
    spl <- split(x,df[x,i])
    lapply(spl,function(x)makeTree2(x,i+1,n))   ##Can't use Recall()
  }
}

This is now called as

## yielding (with the root implicit now)

$A
$A$Aa
[1] "Aa1"

$A$Ab
[1] "Ab1" "Ab2"


$B
$B$Ba
[1] "Ba1"

$B$Bd
[1] "Bd2"


$C
$C$C1
[1] "C11"

$C$C2
[1] "C12"

$C$C3
[1] "C13"

On Wed, Mar 13, 2013 at 10:25 AM, Not To Miss <not.to.miss at gmail.com> wrote:

The ideal solution, I think, is probably recursive. In the last min I
decided to wrote a python script to do this ( use python instead of perl or
R, because of python mutable dict data structure), although I had preferred
to keep all my code in one R piece. I post code here just in case you are
interested. It generates a dict of dict of dict ...

Hopefully I would not get beaten up for posting python code in R mailing
list. :-)

    import sys
    tree = {}
    ## input file is a table with columns TAB demilited
    for line in open(sys.argv[1]):
        if line.startswith('#'): continue
        items = line.strip().split('\t')
        tmp = tree
        for item in items:
            if not item in tmp:
                tmp[item]={}
            tmp = tmp[item]

The tree looks like this for the example:
{'A': {'Aa': {'Aa1': {}}, 'Ab': {'Ab1': {}, 'Ab2': {}}}, 'C': {'C3': {'C13':
{}}, 'C2': {'C12': {}}, 'C1': {'C11': {}}}, 'B': {'Bd': {'Bd2': {}}, 'Ba':
{'Ba1': {}}}}

On Wed, Mar 13, 2013 at 10:35 AM, David Winsemius <dwinsemius at comcast.net>
wrote:


On Mar 12, 2013, at 9:22 PM, Not To Miss wrote:

Nope, Bert, you miss me? :-D

I apologize that I didn't provide a more realistic example and describe
the problem more clearly. The real data are just too complicated to post in
emails, so I made up a simple example, which perhaps seems a little over
simplistic now, but the basic structure are the same. Here is a more
approapriate one:

data.frame(a=c('A','A', 'A', 'B','B','C','C','C'), b=c('Aa',
'Ab','Ab','Ba','Bd', 'C1','C2','C3'), c=c('Aa1', 'Ab1', 'Ab2', 'Ba1', 'Bd2',
'C11','C12','C13'))

  a  b   c
1 A Aa Aa1
2 A Ab Ab1
3 A Ab Ab2
4 B Ba Ba1
5 B Bd Bd2
6 C C1 C11
7 C C2 C12
8 C C3 C13

The data structure to convert to:
     |---Aa------Aa1
 A---|        /--Ab1
 |   |---Ab--|
 |            \--Ab2
 |   |---Ba------Ba1
 B---|
 |   |---Bd------Bd2
 |
 |    /---C1-----C11
 C---|----C2-----C12
      \---C3-----C13

It's multi-level nested and I won't know how many rows and columns of the
data.frame ahead of time. I plan to write a perl script to do the
conversion, just more familiar, if it's not easy to do in R. Thanks Don and
Greg for suggesting solutions.


After a bit of coding I am going to say your proposed answer is wrong (or
at least improperly specified). The first level can be recovered as you
suggest :

sapply(unique(dfrm[[1]]), function(x) dfrm[[2]][grep(x, dfrm[[2]]) ])

$A
[1] "Aa" "Ab" "Ab"

$B
[1] "Ba" "Bd"

$C
[1] "C1" "C2" "C3"


But the second level cannot be as you imagined. The third level items
beginning with "C1" all get associated together and there are no terminal
nodes for C2 or C3 at the third level.

sapply(unique(dfrm[[2]]), function(x) dfrm[[3]][grep(x, dfrm[[3]]) ])

$Aa
[1] "Aa1"

$Ab
[1] "Ab1" "Ab2"

$Ba
[1] "Ba1"

$Bd
[1] "Bd2"

$C1
[1] "C11" "C12" "C13"

$C2
character(0)

$C3
character(0)

lev1 <- sapply(unique(dfrm[[1]]), function(x) dfrm[[2]][grep(x, dfrm[[2]])
])
 lapply(lev1, function(ll) lapply(ll, function(lll) dfrm[[3]][grep(lll,
dfrm[[3]]) ])  )

$A
$A[[1]]
[1] "Aa1"

$A[[2]]
[1] "Ab1" "Ab2"

$A[[3]]
[1] "Ab1" "Ab2"


$B
$B[[1]]
[1] "Ba1"

$B[[2]]
[1] "Bd2"


$C
$C[[1]]
[1] "C11" "C12" "C13"

$C[[2]]
character(0)

$C[[3]]
character(0)

--
David.



On Tue, Mar 12, 2013 at 2:18 PM, Bert Gunter <gunter.berton at gene.com>
wrote:

So Mr. "not.tomiss" missed?

:(

-- Bert

On Tue, Mar 12, 2013 at 1:08 PM, David Winsemius <dwinsemius at comcast.net>
wrote:

On Mar 12, 2013, at 9:37 AM, Not To Miss wrote:

Thanks. Is there any more elegant solution? What if I don't know how
many
levels of nesting ahead of time?

It's even worse than what you now offer as a potential complication.
You did not provide an example of a data object that would illustrate the
complexity of the task nor what you consider the correct procedure (i.e. the
order of the columns to be used for splitting) nor the correct results. The
task is woefully underspecified at the moment. It's a bit akin to asking
"how do I do classification" without saying what you what to classify.

--
David.


On Tue, Mar 12, 2013 at 8:51 AM, Greg Snow <538280 at gmail.com> wrote:

You can use the lapply or rapply functions on the resulting list to
break
each piece into a list itself, then apply the lapply or rapply
function to
those resulting lists, ...


On Mon, Mar 11, 2013 at 3:41 PM, Not To Miss
<not.to.miss at gmail.com>wrote:

Thanks. That's just an simple example - what if there are more
columns and
more rows? Is there any easy way to create nested list?

Best,
Zech


On Mon, Mar 11, 2013 at 2:12 PM, MacQueen, Don <macqueen1 at llnl.gov>
wrote:

You will have to decide what R data structure is a "tree
structure". But
maybe this will get you started:

foo <- data.frame(x=c('A','A','B','B'), y=c('Ab','Ac','Ba','Bd'))
split(foo$y, foo$x)

$A
[1] "Ab" "Ac"

$B
[1] "Ba" "Bd"

I suppose it is at least a little bit tree-like.


--
Don MacQueen

Lawrence Livermore National Laboratory
7000 East Ave., L-627
Livermore, CA 94550
925-423-1062





On 3/10/13 9:19 PM, "Not To Miss" <not.to.miss at gmail.com> wrote:

I have a data.frame object like:

data.frame(x=c('A','A','B','B'), y=c('Ab','Ac','Ba','Bd'))

x  y
1 A Ab
2 A Ac
3 B Ba
4 B Bd

how could I create a tree structure object like this:
   |---Ab
A---|
_|   |---Ac
|
|   |---Ba
B---|
   |---Bb

Thanks,
Zech

     [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm