Skip to content

`dendrapply` Enhancements

3 messages · Lakshman, Aidan H, Ivan Krylov

#
Just wanted to give an update on the status of this, since it?s been a couple days and I?ve had a chance to work on it a little more.

Improvements:
- Fixed a few bugs, added some more robust checking to ensure correct checking for leaf nodes
- Corrected references to ?in-order? traversals, I actually meant ?pre-order?
- Added new documentation, including some new examples to the ?Usage? section
- Cleaned up some names/variables/identifiers
- Added some additional code to have function accurately replicate a weird bug of `stats::dendrapply` that is used in CRAN packages. Full details are in my PR (linked below).


I?ve integrated this into the svn mirror at r-devel/r-svn, and put out a PR at https://github.com/r-devel/r-svn/pull/111. Current PR is passing all build checks aside from Windows, which is throwing the error `Sorry, but: Error response from server: 500` while installing Miktex. I?m not sure what?s causing this, but it seems to be something aside from my code because it?s also crashing builds for other PRs.

A link to the diff file is here: https://patch-diff.githubusercontent.com/raw/r-devel/r-svn/pull/111.diff

Happy to open a Bugzilla report as well; this is enough code that a discussion is probably warranted, and Bugzilla may be an easier place to discuss compared to here. Also happy to discuss on the PR itself.

Thank you to everyone that has taken a look at my code, I appreciate people taking the time to read through it.

Sincerely,
Aidan Lakshman

-----------------------
Aidan Lakshman (he/him)<https://www.ahl27.com/>
Doctoral Candidate, Wright Lab<https://www.wrightlabscience.com/>
University of Pittsburgh School of Medicine
Department of Biomedical Informatics
ahl27 at pitt.edu
(724) 612-9940


From: Lakshman, Aidan H <AHL27 at pitt.edu>
Date: Friday, February 24, 2023 at 07:42
To: Toby Hocking <tdhock5 at gmail.com>
Cc: R-devel at r-project.org <R-devel at r-project.org>
Subject: Re: [Rd] `dendrapply` Enhancements
Hi Toby,

Thanks for your reply! I haven?t heard about the R project sprint, but I?ll definitely check it out. UK is going to be a little hard for me to get to funding-wise, but I?ll try to apply for funding.

I appreciate your other comments. As far as coding style, I did do everything I could think of to make sure it?s a drop-in replacement for the current version with the default settings, so all the user-exposed arguments/variables should be identical. I used the conventions in https://github.com/wch/r-source/wiki/Contributing for commenting and whitespace, so hopefully that all looks okay. I?m realizing there may be some differences in tab widths, but I can fix that later today.

-Aidan

-----------------------
Aidan Lakshman (he/him)<https://www.ahl27.com/>
Doctoral Candidate, Wright Lab<https://www.wrightlabscience.com/>
University of Pittsburgh School of Medicine
Department of Biomedical Informatics
ahl27 at pitt.edu
(724) 612-9940


From: Toby Hocking <tdhock5 at gmail.com>
Date: Friday, February 24, 2023 at 06:57
To: Lakshman, Aidan H <AHL27 at pitt.edu>
Cc: R-devel at r-project.org <R-devel at r-project.org>
Subject: Re: [Rd] `dendrapply` Enhancements
Hi Aidan, I think you are on the right email list.
I'm not R-core, but this looks like an interesting/meaningful/significant contribution to base R.
I'm not sure what the original dendrapply looks like in terms of code style (variable names/white space formatting/etc) but in my experience it is important that your code contribution makes minimal changes in that area.
Did you hear about the R project sprint 2023? https://contributor.r-project.org/r-project-sprint-2023/<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcontributor.r-project.org%2Fr-project-sprint-2023%2F&data=05%7C01%7CAHL27%40pitt.edu%7C41b702f139034cd7165608db165e4530%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C638128366414606277%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=oB%2BBivUsBjIgBtZNU8mh%2Fz2rujD3bv9MbWdxqNtyUFk%3D&reserved=0> Your work falls into the "new developments" category so I think you could apply for that funding to participate.
Toby
On Fri, Feb 24, 2023 at 3:47 AM Lakshman, Aidan H <AHL27 at pitt.edu<mailto:AHL27 at pitt.edu>> wrote:
Hi everyone,

My apologies if this isn?t the right place to submit this?I?m new to the R-devel community and still figuring out what is where.

If people want to skip my writeup and just look at the code, I?ve made a repository for it here: https://github.com/ahl27/new_dendrapply/tree/master<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fahl27%2Fnew_dendrapply%2Ftree%2Fmaster&data=05%7C01%7CAHL27%40pitt.edu%7C41b702f139034cd7165608db165e4530%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C638128366414606277%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=QI%2B5t1C%2BJB15D8o8noZra4W87fgyITm12nluGN%2BFNoE%3D&reserved=0>. I?m not quite sure how to integrate it into a fork of R-devel; the package structure is different from what I?m used to.

I had written a slightly improved version of dendrapply for one of my research projects, and my advisor encouraged me to submit it to the R project. It took me longer than I expected, but I?ve finally gotten my implementation to be a drop-in replacement for `stats::dendrapply`. The man page for `stats::dendrapply` says ?The implementation is somewhat experimental and suggestions for enhancements (or nice examples of usage) are very welcome,? so I figured this had the potential to be a worthwhile contribution. I wanted to send it out to R-devel to see if this was something worth pursuing as an enhancement to R.

The implementation I have is based in C, which I understand implies an increased burden of maintenance over pure R code. However, it does come with the following benefits:

- Completely eliminates recursion, so no memory overhead from function calls or possibility of stack overflows (this was a major issue reported on some of the functions in one of our Bioconductor packages that previously used `dendrapply`).
- Modest runtime improvement, around 2x on my computer (2021 MBP, 32GB RAM). I?m relatively confident this could be optimized more.
- Seemingly significant reduction in memory reduction, still working on a robust benchmark. Suggestions for the best way to do that are welcome.
- Support for applying functions with an inorder traversal (as in `stats::dendrapply`) as well as using a postorder traversal.

This implementation was tested manually as well as running all the unit tests in `dendextend`, which comprises a lot of applications of `dendrapply`.

The postorder traversal would be a significant new functionality to dendrapply, as it would allow for functions that use the child nodes to correctly execute. A toy example of this is something like:
```
exFunc <- function(x){
  attr(x, 'newA') <- 'a'
  if(is.null(attr(x, 'leaf'))){
    cat(attr(x[[1]], 'newA'), attr(x[[2]], 'newA'))
    cat('\n')
  }
  x
})

dendrapply(dend, exFunc)
```

With the current version of dendrapply, this prints nothing, but the postorder traversal version will print ?a? twice for each internal branch. If this would be a worthwhile addition, I can refactor the code for brevity and add a `how=c("in.order", "post.order")`, with the default value ?in.order? to maintain backwards compatibility. A preorder traversal version should also be possible, I just haven?t gotten to it yet.

I think the runtime could be optimized more as well.

Thank you in advance for looking at my code and offering feedback; I?m excited at the possibility of helping contribute to the R project! I?m happy to discuss more either here, on GitHub, or on the R Contributors Slack.

Sincerely,
Aidan Lakshman

-----------------------
Aidan Lakshman (he/him)<https://www.ahl27.com/<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ahl27.com%2F&data=05%7C01%7CAHL27%40pitt.edu%7C41b702f139034cd7165608db165e4530%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C638128366414606277%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=7KUpJpdulSIzSXbpDJlyUV8pMJm%2BSVFvDOJTlVs9lhc%3D&reserved=0>>
Doctoral Candidate, Wright Lab<https://www.wrightlabscience.com/<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.wrightlabscience.com%2F&data=05%7C01%7CAHL27%40pitt.edu%7C41b702f139034cd7165608db165e4530%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C638128366414606277%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=JMgw%2BMiQ6xdp3OokToJ2nyyco%2BryiFH%2B9ap5iU3yJH8%3D&reserved=0>>
University of Pittsburgh School of Medicine
Department of Biomedical Informatics
ahl27 at pitt.edu<mailto:ahl27 at pitt.edu>
(724) 612-9940



______________________________________________
R-devel at r-project.org<mailto:R-devel at r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-devel&data=05%7C01%7CAHL27%40pitt.edu%7C41b702f139034cd7165608db165e4530%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C0%7C638128366414606277%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=I88%2FQhGHXDRS2yHqvh53k3MSWHSd5z2KBgORUHIxfG0%3D&reserved=0>
#
Dear Aidan Lakshman,

To answer your implicit question, VECTOR_ELT() unclasses the nodes
because it doesn't go through the stats:::`[[.dendrogram` method,
instead dereferencing the data pointer directly.

Other *apply functions in base R create a call to the `[[` operator,
letting the language dispatch the generic call, allowing the method to
assign a class to the return value. The following example is taken from
src/main/apply.c:do_lapply():

// prepare a call to FUN(X[[i]], ...)

    SEXP isym = install("i");
    SEXP tmp = PROTECT(lang3(R_Bracket2Symbol, X, isym));
    SEXP R_fcall = PROTECT(lang3(FUN, tmp, R_DotsSymbol));
    MARK_NOT_MUTABLE(R_fcall);

// inside the loop: evaluate the call

        tmp = R_forceAndCall(R_fcall, 1, rho);

Not sure which way is faster, but it may make sense to try, and it's
probably more correct in (contrived) cases where unclass(node)[[i]] is
invalid because it relies on a hypothetical `[[.subclass-of-dendrogram`
to restore some invariants.

Would you mind telling me more about the following case?
If you're looking to improve the performance, there might be a way to
avoid the wrapper and this lapply(unclass(node), identity) call in it.
#
Thanks for your reply!
That?s roughly what I had suspected?I appreciate the clarification.

To your point on other *apply functions, I wasn?t actually aware of that implementation, but it?s definitely a smarter way to do it. I?ll try later today/tomorrow to incorporate that method; it seems much better and more future-proof than my approach. Definitely agree with you with respect to cases where unclass(node)[[i]] is invalid. It may be slightly slower due to having to rely on R method dispatch, but I think the benefits outweigh the drawbacks in this case.
This was a product of trying to get performance to be the same as in the current method?I agree that it?s probably not the best way to do this. The use-case is when you apply a function to the dendrogram that doesn?t return a dendrogram object. One example is the one from reg-tests-1c.R:

```
D <- as.dendrogram(hclust(dist(cbind(setNames(c(0,1,4), LETTERS[1:3])))))

dendrapply(D, labels))



# Expected result:

#
# [[1]]

# ?C?
#
# [[2]]
# [[2]][[1]]
# ?A?

#

# [[2]][[2]]

# ?B?

#

# [[3]]

# ?C?
```

Applying labels to the root node returns c(?C?, ?A?, ?B?), and if we convert that to a list, we get a length 3 list of length 1 character vectors. However, when traversing the dendrogram pre-order, this would break things, since then the first entry of the node is no longer a dendrogram object, it?s been replaced by a character vector. I had written it this way with the unclass so that I could replace entries that needed to be evaluated at child nodes with child nodes. For example, in this instance, after evaluating the function at the root, the tree would look like:

```
[[1]]
<unclassed D[[1]]>

[[2]]
<unclassed D[[2]]>

[[3]]
?B?
```

To answer the question on why there?s an lapply(?, identity) call, I think I ended up doing it this way because I was having some issues with not getting the elements to populate correctly from the dendrogram. Looking back on it now, there?s definitely an easier way to do this that isn?t so hard to understand code-wise?.
```
if(!is.leaf(node)){
      if(!is.list(res)){
        res <- as.list(res)
      }
      res[seq_along(node)] <- node
    }
```
That should perform almost identically and make more sense, with the added benefit that it doesn?t unclass the child nodes, so (when I also incorporate the other fix you suggested) we shouldn?t have any unexpected performance from functions relying on a hypothetical `subclass-of-dendrogram`. This implementation is also slightly faster due to no lapply call and is.list() over inherits(?).

Result after applying to root node with this approach:
```
[[1]]
D[[1]]

[[2]]
D[[2]]

[[3]]
?B?
```
Classes of D[[1]] and D[[2]] are preserved for future evaluations.

Thanks for pointing this out, I?ll incorporate this into the code when I check the `[[` case later. If you have any other questions/comments/suggestions I would love to hear them! Happy to clarify further as well if I didn?t answer your questions fully.

Sincerely,
Aidan

-----------------------
Aidan Lakshman (he/him)<https://www.ahl27.com/>
Doctoral Candidate, Wright Lab<https://www.wrightlabscience.com/>
University of Pittsburgh School of Medicine
Department of Biomedical Informatics
ahl27 at pitt.edu
(724) 612-9940


From: Ivan Krylov <krylov.r00t at gmail.com>
Date: Thursday, March 2, 2023 at 09:47
To: Lakshman, Aidan H <AHL27 at pitt.edu>
Cc: R-devel at r-project.org <R-devel at r-project.org>
Subject: Re: [Rd] `dendrapply` Enhancements
Dear Aidan Lakshman,

To answer your implicit question, VECTOR_ELT() unclasses the nodes
because it doesn't go through the stats:::`[[.dendrogram` method,
instead dereferencing the data pointer directly.

Other *apply functions in base R create a call to the `[[` operator,
letting the language dispatch the generic call, allowing the method to
assign a class to the return value. The following example is taken from
src/main/apply.c:do_lapply():

// prepare a call to FUN(X[[i]], ...)

    SEXP isym = install("i");
    SEXP tmp = PROTECT(lang3(R_Bracket2Symbol, X, isym));
    SEXP R_fcall = PROTECT(lang3(FUN, tmp, R_DotsSymbol));
    MARK_NOT_MUTABLE(R_fcall);

// inside the loop: evaluate the call

        tmp = R_forceAndCall(R_fcall, 1, rho);

Not sure which way is faster, but it may make sense to try, and it's
probably more correct in (contrived) cases where unclass(node)[[i]] is
invalid because it relies on a hypothetical `[[.subclass-of-dendrogram`
to restore some invariants.

Would you mind telling me more about the following case?
If you're looking to improve the performance, there might be a way to
avoid the wrapper and this lapply(unclass(node), identity) call in it.

--
Best regards,
Ivan