Here is a simpler, less clumsy version of my previous recursive R
solution that I sent you privately, which I'll also cc to the list
this time. It's now almost a one-liner.
To avoid problems with unused factor levels, I still prefer to have
character vectors not factors, as the data frame columns so:
df <- data.frame(a=c('A','A', 'A', 'B','B','C','C','C'), b=c('Aa',
'Ab','Ab','Ba','Bd', 'C1','C2','C3'), c=c('Aa1', 'Ab1', 'Ab2', 'Ba1',
'Bd2', 'C11','C12','C13'), stringsAsFactors=FALSE)
makeTree2 <-function(x, i,n)
{
if(i==n)df[x,i]
else {
spl <- split(x,df[x,i])
lapply(spl,function(x)makeTree2(x,i+1,n)) ##Can't use Recall()
}
}
This is now called as
makeTree2(seq_len(nrow(df)),1,ncol(df)) ## no list structure needed for x
## yielding (with the root implicit now) $A $A$Aa [1] "Aa1" $A$Ab [1] "Ab1" "Ab2" $B $B$Ba [1] "Ba1" $B$Bd [1] "Bd2" $C $C$C1 [1] "C11" $C$C2 [1] "C12" $C$C3 [1] "C13"
On Wed, Mar 13, 2013 at 10:25 AM, Not To Miss <not.to.miss at gmail.com> wrote:
The ideal solution, I think, is probably recursive. In the last min I
decided to wrote a python script to do this ( use python instead of perl or
R, because of python mutable dict data structure), although I had preferred
to keep all my code in one R piece. I post code here just in case you are
interested. It generates a dict of dict of dict ...
Hopefully I would not get beaten up for posting python code in R mailing
list. :-)
import sys
tree = {}
## input file is a table with columns TAB demilited
for line in open(sys.argv[1]):
if line.startswith('#'): continue
items = line.strip().split('\t')
tmp = tree
for item in items:
if not item in tmp:
tmp[item]={}
tmp = tmp[item]
The tree looks like this for the example:
{'A': {'Aa': {'Aa1': {}}, 'Ab': {'Ab1': {}, 'Ab2': {}}}, 'C': {'C3': {'C13':
{}}, 'C2': {'C12': {}}, 'C1': {'C11': {}}}, 'B': {'Bd': {'Bd2': {}}, 'Ba':
{'Ba1': {}}}}
On Wed, Mar 13, 2013 at 10:35 AM, David Winsemius <dwinsemius at comcast.net>
wrote:
On Mar 12, 2013, at 9:22 PM, Not To Miss wrote: Nope, Bert, you miss me? :-D I apologize that I didn't provide a more realistic example and describe the problem more clearly. The real data are just too complicated to post in emails, so I made up a simple example, which perhaps seems a little over simplistic now, but the basic structure are the same. Here is a more approapriate one:
data.frame(a=c('A','A', 'A', 'B','B','C','C','C'), b=c('Aa',
'Ab','Ab','Ba','Bd', 'C1','C2','C3'), c=c('Aa1', 'Ab1', 'Ab2', 'Ba1', 'Bd2',
'C11','C12','C13'))
a b c
1 A Aa Aa1
2 A Ab Ab1
3 A Ab Ab2
4 B Ba Ba1
5 B Bd Bd2
6 C C1 C11
7 C C2 C12
8 C C3 C13
The data structure to convert to:
|---Aa------Aa1
A---| /--Ab1
| |---Ab--|
| \--Ab2
| |---Ba------Ba1
B---|
| |---Bd------Bd2
|
| /---C1-----C11
C---|----C2-----C12
\---C3-----C13
It's multi-level nested and I won't know how many rows and columns of the
data.frame ahead of time. I plan to write a perl script to do the
conversion, just more familiar, if it's not easy to do in R. Thanks Don and
Greg for suggesting solutions.
After a bit of coding I am going to say your proposed answer is wrong (or
at least improperly specified). The first level can be recovered as you
suggest :
sapply(unique(dfrm[[1]]), function(x) dfrm[[2]][grep(x, dfrm[[2]]) ])
$A [1] "Aa" "Ab" "Ab" $B [1] "Ba" "Bd" $C [1] "C1" "C2" "C3" But the second level cannot be as you imagined. The third level items beginning with "C1" all get associated together and there are no terminal nodes for C2 or C3 at the third level.
sapply(unique(dfrm[[2]]), function(x) dfrm[[3]][grep(x, dfrm[[3]]) ])
$Aa [1] "Aa1" $Ab [1] "Ab1" "Ab2" $Ba [1] "Ba1" $Bd [1] "Bd2" $C1 [1] "C11" "C12" "C13" $C2 character(0) $C3 character(0) lev1 <- sapply(unique(dfrm[[1]]), function(x) dfrm[[2]][grep(x, dfrm[[2]]) ]) lapply(lev1, function(ll) lapply(ll, function(lll) dfrm[[3]][grep(lll, dfrm[[3]]) ]) ) $A $A[[1]] [1] "Aa1" $A[[2]] [1] "Ab1" "Ab2" $A[[3]] [1] "Ab1" "Ab2" $B $B[[1]] [1] "Ba1" $B[[2]] [1] "Bd2" $C $C[[1]] [1] "C11" "C12" "C13" $C[[2]] character(0) $C[[3]] character(0) -- David. On Tue, Mar 12, 2013 at 2:18 PM, Bert Gunter <gunter.berton at gene.com> wrote:
So Mr. "not.tomiss" missed? :( -- Bert On Tue, Mar 12, 2013 at 1:08 PM, David Winsemius <dwinsemius at comcast.net> wrote:
On Mar 12, 2013, at 9:37 AM, Not To Miss wrote:
Thanks. Is there any more elegant solution? What if I don't know how many levels of nesting ahead of time?
It's even worse than what you now offer as a potential complication. You did not provide an example of a data object that would illustrate the complexity of the task nor what you consider the correct procedure (i.e. the order of the columns to be used for splitting) nor the correct results. The task is woefully underspecified at the moment. It's a bit akin to asking "how do I do classification" without saying what you what to classify. -- David.
On Tue, Mar 12, 2013 at 8:51 AM, Greg Snow <538280 at gmail.com> wrote:
You can use the lapply or rapply functions on the resulting list to break each piece into a list itself, then apply the lapply or rapply function to those resulting lists, ... On Mon, Mar 11, 2013 at 3:41 PM, Not To Miss <not.to.miss at gmail.com>wrote:
Thanks. That's just an simple example - what if there are more columns and more rows? Is there any easy way to create nested list? Best, Zech On Mon, Mar 11, 2013 at 2:12 PM, MacQueen, Don <macqueen1 at llnl.gov> wrote:
You will have to decide what R data structure is a "tree structure". But maybe this will get you started:
foo <- data.frame(x=c('A','A','B','B'), y=c('Ab','Ac','Ba','Bd'))
split(foo$y, foo$x)
$A [1] "Ab" "Ac" $B [1] "Ba" "Bd" I suppose it is at least a little bit tree-like. -- Don MacQueen Lawrence Livermore National Laboratory 7000 East Ave., L-627 Livermore, CA 94550 925-423-1062 On 3/10/13 9:19 PM, "Not To Miss" <not.to.miss at gmail.com> wrote:
I have a data.frame object like:
data.frame(x=c('A','A','B','B'), y=c('Ab','Ac','Ba','Bd'))
x y
1 A Ab
2 A Ac
3 B Ba
4 B Bd
how could I create a tree structure object like this:
|---Ab
A---|
_| |---Ac
|
| |---Ba
B---|
|---Bb
Thanks,
Zech
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- Gregory (Greg) L. Snow Ph.D. 538280 at gmail.com
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
David Winsemius Alameda, CA, USA
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- Bert Gunter Genentech Nonclinical Biostatistics Internal Contact Info: Phone: 467-7374 Website: http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm
David Winsemius Alameda, CA, USA
Bert Gunter Genentech Nonclinical Biostatistics Internal Contact Info: Phone: 467-7374 Website: http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm