Skip to content

Digest package - make digest generic?

7 messages · Roger D. Peng, Hadley Wickham, Henrik Bengtsson +1 more

#
Sorry, I forgot the 'reply-all'.

-roger

---------- Forwarded message ----------
From: Roger Peng <rdpeng at gmail.com>
Date: Oct 16, 2007 8:24 AM
Subject: Re: [Rd] Digest package - make digest generic?
To: Henrik Bengtsson <hb at stat.berkeley.edu>


Would it be possible to instead create a function with a name like
'digest0' which is the current function, and then create a generic
function with the name 'digest'?  In this case 'digest0' always
returns the digest of the "raw" object.

My one concern is that my current expectation is that 'digest' takes
an object and hashes the entire object, regardless of class.  So if
two objects are different (even in their internal representation),
they should return different digests.  I would be a little worried if
'digest' had a different (and perhaps unpredictable) behavior
depending on the class of the object where two objects that were in
fact different could lead to the same digest.

I can see why one might want class-specific behavior, but what a class
author wants from 'digest' may not be different from what other users
of 'digest' on that object want.

A simple approach might be

digest0 <- function(x, ...) digest(unclass(x), ...)

although this doesn't work for S4 objects I don't think.

-roger
On 10/15/07, Henrik Bengtsson <hb at stat.berkeley.edu> wrote:
--
Roger D. Peng  |  http://www.biostat.jhsph.edu/~rpeng/
#
Hi Roger,
On 16 October 2007 at 08:25, Roger Peng wrote:
| Sorry, I forgot the 'reply-all'.
| 
| -roger
| 
| ---------- Forwarded message ----------
| From: Roger Peng <rdpeng at gmail.com>
| Date: Oct 16, 2007 8:24 AM
| Subject: Re: [Rd] Digest package - make digest generic?
| To: Henrik Bengtsson <hb at stat.berkeley.edu>
| 
| 
| Would it be possible to instead create a function with a name like
| 'digest0' which is the current function, and then create a generic
| function with the name 'digest'?  In this case 'digest0' always
| returns the digest of the "raw" object.
| 
| My one concern is that my current expectation is that 'digest' takes
| an object and hashes the entire object, regardless of class.  So if
| two objects are different (even in their internal representation),
| they should return different digests.  I would be a little worried if
| 'digest' had a different (and perhaps unpredictable) behavior
| depending on the class of the object where two objects that were in
| fact different could lead to the same digest.

But haven't the cryptographers taken care of that argument?  

To my layman's understanding, the consensus is that hash collissions are
possible but very very unlikely. And we already have that problem with digest
as it stands as -- if collission are possible, identical hashes could result
from two different input whether or not digest is generic or not. 

Or am I missing what you were trying to get at?
 
| I can see why one might want class-specific behavior, but what a class
| author wants from 'digest' may not be different from what other users
| of 'digest' on that object want.
| 
| A simple approach might be
| 
| digest0 <- function(x, ...) digest(unclass(x), ...)

Or, just for argument's sake, we go full circle, digest stays as it is and
Hadley implements his own generic, say, 'Digest()', aroumd digest ?  Naa....

I think I like the idea of making it generic, but I really would like to
know more about possible downsides.

Dirk
 
| although this doesn't work for S4 objects I don't think.
| 
| -roger
|
| On 10/15/07, Henrik Bengtsson <hb at stat.berkeley.edu> wrote:
| > On 10/15/07, hadley wickham <h.wickham at gmail.com> wrote:
| > > On 10/15/07, Henrik Bengtsson <hb at maths.lth.se> wrote:
| > > > [As agreed, CC:ing r-devel since others might be interested in this as well.]
| > > >
| > > > Hi.
| > > >
| > > > On 10/15/07, Dirk Eddelbuettel <edd at debian.org> wrote:
| > > > >
| > > > > Hi Hadley,
| > > > >
| > > > > On 15 October 2007 at 09:51, hadley wickham wrote:
| > > > > | Would you consider making digest a generic function?  That way I could
| > > > > | (e.g.) make a generic method for ggplot objects which didn't depend
| > > > > | (so much) on their internal representation.
| > > > >
| > > > > Well, generally speaking, I always take patches :)
| > > >
| > > > I see know problems in doing this.  The patch would be:
| > > >
| > > > digest <- function(...) UseMethod("digest");
| > > > digest.default <- <current digest function>.
| > > >
| > > > I think that should do, and I don't think it has any surprising side
| > > > effects so it could be added in the next release.  Dirk, can you do
| > > > that?
| > > >
| > > > >
| > > > > I have to admit that I am fairly weak on these aspects of the S language.
| > > > > One question is:  how to the current users of digest (i.e. Henrik's and
| > > > > Seth's caching mechanism, for example) use it on arbitrary objects _without_
| > > > > it being generic?
| > > >
| > > > I basically put everything I want into a list() and pass that to
| > > > digest::digest().
| > >
| > > Yes, that's what I'm doing too.
| > >
| > > > >
| > > > > | The reason I ask is that I'm using digest as a way of coming up with a
| > > > > | unique file name for each example graphic.  I want to be able to
| > > > > | easily compare the appearance of examples between versions, but
| > > > > | currently the digest depends on internal details, so it's hard to
| > > > > | match up graphics between versions.
| > > >
| > > > See loadCache(key) and saveCache(object, key) in R.cache, which
| > > > basically loads and saves results from and to a file cache based on a
| > > > key object - no need to specify paths or filenames.  You can specify
| > > > paths etc if you want to, but by default it is just transparent.
| > >
| > > The problem is I need to refer to the image from the documentation, so
| > > I do need to know it's path.  I also want to be able to look at the
| > > image, so if the digests are different I can see what the difference
| > > is (I'm planning to automate this with the imagemagick compare command
| > > line tool).
| >
| > See ?findCache.  That will give you the pathname given a key.  It is
| > on purpose that I do not list this function in the HTML help index - I
| > want to keep the "public" API to a minimum.
| >
| > /Henrik
| >
| > >
| > > > However, I think Hadley is referring to a different problem.
| > > > Basically, he got an object containing a lot of fields, but for his
| > > > purposes it is only a subset of the fields that he wants to use to
| > > > generate a consistent the hashcode.  If he pass any other field, that
| > >
| > > Yes, exactly.
| > >
| > > > will break the consistency.  In that case, the designer of the class
| > > > has to identify the fields that makes uniquely identify the state of
| > > > the object.  I do that for many of my object and pass them down in a
| > > > list() structure to digest().  I agree, by making digest() generic,
| > > > one can make the code nicer.  [If there is a need to dispatch on
| > > > multiple arguments, we have to go for S4, but otherwise S3 gives the
| > > > minimal modification].
| > > >
| > > > Side comment: This basically comes down to how for instance Java deals
| > > > with hashCode() and equals() etc.  By default the object as is used to
| > > > generate the hashcode (and can be used by equals() compare objects).
| > >
| > > Yes, that's the model I was thinking of too.
| > >
| > > Hadley
| > >
| > > --
| > > http://had.co.nz/
| > >
| > > ______________________________________________
| > > R-devel at r-project.org mailing list
| > > https://stat.ethz.ch/mailman/listinfo/r-devel
| > >
| >
| > ______________________________________________
| > R-devel at r-project.org mailing list
| > https://stat.ethz.ch/mailman/listinfo/r-devel
| >
| 
| 
| --
| Roger D. Peng  |  http://www.biostat.jhsph.edu/~rpeng/
| 
| 
| -- 
| Roger D. Peng  |  http://www.biostat.jhsph.edu/~rpeng/
| 
| ______________________________________________
| R-devel at r-project.org mailing list
| https://stat.ethz.ch/mailman/listinfo/r-devel
#
My understanding was that Hadley wanted 'digest' to operate on part of
an object rather than on the entire, which might contain uninteresting
or irrelevant details.  For example, if we had

a <- structure(list(x = 1, y = 2), class = "foo")
b <- structure(list(x = 2342342, y = 2), class = "foo")

digest.foo <- function(object, ...) digest(object$y)

Then 'digest(a)' and 'digest(b)' would return the same value in this
case, even though 'a' and 'b' are different objects.  I can see why
someone *might* want digest to return the same hash for 'a' and 'b'
but I would personally find this behavior a little surprising.

-roger
On 10/16/07, Dirk Eddelbuettel <edd at debian.org> wrote:

  
    
#
On 10/16/07, Roger Peng <rdpeng at gmail.com> wrote:
Yes, that's exactly what I want, except in my case my objects contain
about 20 or 30 bits of information that are irrelevant (I'm my case
documentation about the class and other functions), so it would be
surprising if p1 and p2 which produced identical plots gave different
digests.

If you want the default behaviour, you could always call
digest.default to digest the entire object.

Hadley
#
Calling 'digest.default' directly would not be possible if the method
were hidden in a namespace (without resorting to some maneuvering).
To force the default method I think you'd need to 'unclass' the
object.

I'm not against making 'digest' generic, but I'd prefer it if there
were a guaranteed way to compute the digest of the "raw"/full object
without having to wonder about class-specific behavior.  Something
like:

digest0 <- [[the current 'digest' function]]
digest <- function(object, ...) UseMethod("digest")
digest.default <- function(object, ...) digest0(object, ...)

As I think we've seen in this discussion already, what is surprising
to one person may not be surprising to another (and vice versa) so
having something like 'digest0' which is consistent across all R
objects would be useful.

-roger
On 10/16/07, hadley wickham <h.wickham at gmail.com> wrote:

  
    
#
Hi,

if there is a need for a digest0(), which there seems to be, we should
have one, but we should find a better name.

A better approach may be to keep digest() as is and introduce
hashCode() for the feature Hadley requested, e.g.

hashCode <- function(...) UseMethod("hashCode");
hashCode.default <- function(...) digest(...);

Personally, I think hashCode() is a more descriptive term of the
value/outcome whereas digest() describes the action.  Of course, some
of the arguments of digest() should be excluded from hashCode(), but
the above gives you the idea.

That would make the distinction clear, and it is very much in line how
Java is doing it (sorry Dylan folks).  I think the Java got a useful
setup with its hashCode() & equals() methods.  If you want the
details, here is one reference:

  http://www.geocities.com/technofundo/tech/java/equalhash.html

but the short story is that <quote>two objects that are "equal" must
produce the same hash code as long as they are equal, however unequal
objects need not produce distinct hash codes.</quote>.  The equals()
relationship should be reflexive, symmetric, transitive, consistent.
For details, see above URL.  These rules are very useful, but requires
quite a bit of effort from the developer/maintainer in order to keep
it up to date and valid.

Cheers

Henrik
On 10/16/07, Roger Peng <rdpeng at gmail.com> wrote:
#
On 16 October 2007 at 18:10, Henrik Bengtsson wrote:
| if there is a need for a digest0(), which there seems to be, we should
| have one, but we should find a better name.
|
| A better approach may be to keep digest() as is and introduce

Agreed.  It's better to keep the existing name and functionality.

| hashCode() for the feature Hadley requested, e.g.
| 
| hashCode <- function(...) UseMethod("hashCode");
| hashCode.default <- function(...) digest(...);
| 
| Personally, I think hashCode() is a more descriptive term of the
| value/outcome whereas digest() describes the action.  Of course, some
| of the arguments of digest() should be excluded from hashCode(), but
| the above gives you the idea.

Not sure I like the 'hashCode' name all that much.  How about some verbNoun
combination like 'createHash' ?
 
| That would make the distinction clear, and it is very much in line how
| Java is doing it (sorry Dylan folks).  I think the Java got a useful
| setup with its hashCode() & equals() methods.  If you want the
| details, here is one reference:
| 
|   http://www.geocities.com/technofundo/tech/java/equalhash.html
| 
| but the short story is that <quote>two objects that are "equal" must
| produce the same hash code as long as they are equal, however unequal
| objects need not produce distinct hash codes.</quote>.  The equals()
| relationship should be reflexive, symmetric, transitive, consistent.
| For details, see above URL.  These rules are very useful, but requires
| quite a bit of effort from the developer/maintainer in order to keep
| it up to date and valid.

Yes, I am not sure I can guarantee that.  We can always try, though.

Thanks for the follow-up!

Dirk