Skip to content

Is there a hash data structure for R

13 messages · Eric Berger, Yonghua Peng, Duncan Murdoch +6 more

#
I know this is a newbie question. But how do I implement the hash structure
which is available in other languages (in python it's dict)?

I know there is the list, but list's names can be duplicated here.
$x

[1] 1 2 3 4 5


$y

 [1] "January"   "February"  "March"     "April"     "May"       "June"

 [7] "July"      "August"    "September" "October"   "November"  "December"


$x

[1] 3 4 5 6 7



Thanks a lot.
#
One choice is
new.env(hash=TRUE)
in the base package
On Tue, Nov 2, 2021 at 11:48 AM Yonghua Peng <yong at pobox.com> wrote:

            

  
  
#
True, but in a lot of cases where a python user might use a dict an R 
user will probably use a list; or when we are talking about arrays of 
dicts in python, the R solution will probably be a data.frame (with each 
dict field in a separate column).

Jan
On 02-11-2021 11:18, Eric Berger wrote:
#
But for data.frame the colnames can be duplicated. Am I right?

Regards.
On Tue, Nov 2, 2021 at 6:29 PM Jan van der Laan <rhelp at eoos.dds.nl> wrote:

            

  
  
#
On 02/11/2021 4:13 a.m., Yonghua Peng wrote:
As Eric said, the environment comes pretty close.  One difference 
between lists and environments is that environments are reference 
objects, so this changes e1$x:

   e1 <- new.env()
   e1$x <- 1
   e2 <- e1
   e2$x <- 2

Lists are not reference objects, so this doesn't change l1$x:

   l1 <- list()
   l1$x <- 1
   l2 <- l1
   l2$x <- 2

I don't know which behaviour you'd feel more comfortable with. 
Reference objects are pretty rare in R, so some R programmers would be 
surprised by the environment behaviour.  But in other languages they're 
more common, so maybe newbies would be more surprised by the list behaviour.

Another difference betwee environments and lists is how they handle 
assignments of NULL:

   e1$y <- NULL   # sets e1$y to NULL
   l1$y <- NULL   # does nothing here, would delete l1$y if it existed

I don't like the NULL handling in lists in R, but I'm used to it.

BTW, I don't understand why the possibility of duplicated names is an 
issue.  If you just avoid intentional duplication, you don't get it. 
For example, if instead of this:
you do this:

x <- list(x=1:5,y=month.name)
x$x <- 3:7

there is no duplication.  The only (?) ways to get duplication are in 
the initial construction of the list, or by explicit manipulation of the 
names attribute.  So be careful when you do those, and lists can work.

Duncan Murdoch
#
Yes. A data.frame is basically a list where all elements are vectors of 
the same length. So this issue also exists in a data.frame. However, the 
data.frame construction method will detect this and generate unique 
names (which also might not be what you want):

 > data.frame(a=1:3, a=1:3) 
 
 
 
 

   a a.1 
 
 
 
 
                                                                   1 1 
  1 
 
 
 
 
                                                             2 2   2 
 
 
 
 
 
                                                           3 3   3

But still with a little effort you can still create a data.frame with 
multiple columns with the same name. But as Duncan Murdoch mentions you 
can usually control for that.


Best,
Jan
On 02-11-2021 11:32, Yonghua Peng wrote:
#
On Tue, 2 Nov 2021 at 10:48, Yonghua Peng <yong at pobox.com> wrote:
structure
As other posters wrote then environments are the solution.
data.frames, vectors and lists are much slower and less useful to use as
key-value pairs.

Here are some code I somethings uses:

cache <- NULL

cache_set <- function(key, value) {
  assign(key, value, envir = cache)
}

cache_reset <- function() {
  cache <<- new.env(TRUE, emptyenv())
}

cache_get <- function(key) {
  get(key, envir = cache, inherits = FALSE)
}

cache_has_key <- function(key) {
  exists(key, envir = cache, inherits = FALSE)
}
cache_reset()
#
If you're thinking about using environments, I would suggest you initialize
them like


x <- new.env(parent = emptyenv())


Since environments have parent environments, it means that requesting a
value from that environment can actually return the value stored in a
parent environment (this isn't an issue for [[ or $, this is exclusively an
issue with assign, get, and exists)
Or, if you've already got your values stored in a list that you want to
turn into an environment:


x <- list2env(listOfValues, parent = emptyenv())


Hope this helps!
On Tue, Nov 2, 2021, 06:49 Yonghua Peng <yong at pobox.com> wrote:

            

  
  
#
Note that an environment carries a hash table with it, while a named list
does not.  I think that looking up an entry in a list causes a hash table
to be created and thrown away.  Here are some timings involving setting and
getting various numbers of entries in environments and lists.  The times
are roughly linear in n for environments and quadratic for lists.
FUN.VALUE=NA_real_)
[1] 0.00 0.00 0.00 0.02 0.03 0.06 0.15
[1]  0.01  0.03  0.15  0.53  2.66 13.66 56.05
function(n, L, V = sprintf("V%07d", sample(n, replace=TRUE))) {
    system.time(for(v in V)L[[v]]<-c(L[[v]],v))["elapsed"]
}

Note that environments do not allow an element named "" (the empty string).

Elements named NA_character_ are treated differently in environments and
lists, neither of which is great.  You may want your hash table functions
to deal with oddball names explicitly.

-Bill
On Tue, Nov 2, 2021 at 8:52 AM Andrew Simmons <akwsimmo at gmail.com> wrote:

            

  
  
#
I have several things I considered about this topic.

It is, in general, not possible to do some things in one language or another
even if you find a bridge. Python lets you place all kinds of things into a
dictionary including many static objects like tuples or even other
dictionaries. What is allowed for keys is quite broad. If you use an R
environment or list, there are restrictions on what names are allowed that
are not necessarily the same. 

Now on to key uniqueness. Yes, R allows multiple named entries to share the
same name. But  when I made such a structure, the FIRST instance hides any
later ones when accessing something like list.name$A or setting it. If you
remove an exiting entry by something like setting the above to NULL, though,
the second instance of that name nor becomes the first. So anyone wanting to
fully remove all instances might need to loop till sure all are gone. Not
sure about environments but they may behave better.

Third, multiple parties have built R packages to support hashing including
some no longer available:

https://cran.r-project.org/web/packages/hash/hash.pdf
https://www.rdocumentation.org/packages/Dict/versions/0.10.0

So  if one of those works, why reinvent it?

In any case, if you roll your own, as has been shown by others, you may have
to provide getter and setter functionality and so on to make sure of things
that you want for compatibility and be Careful as any other programs that
can play with your data may go around you.

Finally, someone mentioned how creating a data.frame with duplicate names
for columns is not a problem as it can automagically CHANGE them to be
unique. That is a HUGE problem for using that as a dictionary as the new
name will not be known to the system so all kinds of things will fail.

And there are also packages for many features like sets as well as functions
to manipulate these things.

-----Original Message-----
From: R-help <r-help-bounces at r-project.org> On Behalf Of Bill Dunlap
Sent: Tuesday, November 2, 2021 1:26 PM
To: Andrew Simmons <akwsimmo at gmail.com>
Cc: R Help <r-help at r-project.org>
Subject: Re: [R] Is there a hash data structure for R

Note that an environment carries a hash table with it, while a named list
does not.  I think that looking up an entry in a list causes a hash table to
be created and thrown away.  Here are some timings involving setting and
getting various numbers of entries in environments and lists.  The times are
roughly linear in n for environments and quadratic for lists.
FUN.VALUE=NA_real_)
[1] 0.00 0.00 0.00 0.02 0.03 0.06 0.15
[1]  0.01  0.03  0.15  0.53  2.66 13.66 56.05
function(n, L, V = sprintf("V%07d", sample(n, replace=TRUE))) {
    system.time(for(v in V)L[[v]]<-c(L[[v]],v))["elapsed"] }

Note that environments do not allow an element named "" (the empty string).

Elements named NA_character_ are treated differently in environments and
lists, neither of which is great.  You may want your hash table functions to
deal with oddball names explicitly.

-Bill
On Tue, Nov 2, 2021 at 8:52 AM Andrew Simmons <akwsimmo at gmail.com> wrote:

            
______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
#
On 03-11-2021 00:42, Avi Gross via R-help wrote:

            
I think you are referring to my remark which was:

 > However, the data.frame construction method will detect this and
 > generate unique names (which also might not be what you want):

I didn't say this means that duplicate names aren't a problem; I just 
mentioned the the behaviour is different. Personally, I would actually 
prefer the behaviour of list (keep the duplicated name) with a warning.

Most of the responses seem to assume that the OP actually wants a hash 
table. Yes, he did ask for that and for a hash table an environment 
(with some work) would be a good option. But in many cases, where other 
languages would use a hash-table-like object (such as a dict) in R you 
would use other types of objects. Furthermore, for many operations where 
you might use hash tables to implement the operation, R has already 
built in options, for example %in%, match, duplicated. These are also 
vectorised; so two vectors: one with keys and one with values might 
actually be faster than an environment in some use cases.

Best,
Jan
#
Jack, I was agreeing with you and pointing out that although changing names
of columns to be unique has a positive side, it makes it hard to use for
anything that needs to look like either a set or a bag and of course a
dictionary/hash. All the above want to put things in using some identifier
and expect to get back the same.

R actually has other places names can be changed or dynamically created
often with defaults. This can be convenient or annoying as you need to look
at the resulting names to be able to use them. A feature that is nice in
some programs and parts of the tidyverse packages is the ability to specify
things like suffixes to be used.

I use lots of computer languages and keep running into people who expect
another language to just support what the first does. If that were true, we
might not create so many. If your other language supports indefinite sized
integers, good for you. Many others do not and perhaps may not easily do so
even if you try to create your own emulation if some other code gets your
version and does not work on it.

So assuming you use an available package or roll your own and can now make
some kind of hash data structure. As you pointed out, hashing may not be
required if your implementation is already fast and hashing can use lots of
memory. What are the allowed keys in your implementation? Will an integer of
1 be distinct from a floating point of 1.0? Can you hash objects of the
half-dozen or so kinds R seems to have? Whatever your answers are to many
questions like these, you may not get quite the same answers in Python or
PERL or ...

R is not really meant to be a general-purpose language although it can be.
It is not really fully-blown Object Oriented  but in many ways it can be
using newer grafts. So if you need or want the things R is designed for and
in particular there are good packages available for your needs, then use R.
If your needs include things not easily part of R, and you have a language
that works for you, why switch?

I will note that there is an intermediate path. I often run programs in
RSTUDIO that include code for both R and Python. The data structures used
sort of convert back and forth as needed so you can begin in R and read in
data and make some changes and then hand it over to Python for more, perhaps
multiple times, then generate a graph within R or whatever.

So if you want a dictionary, you can sort of keep it on the Python side and
use Python commands to create it and add to it or access contents. The
results may be handed over to the R side as needed but not as a dict but
instead to a pairlist or named list/vector or whatever and when needed, you
can have python take your results. The same applies to lots of other things
Python does that R may not do quite the same or at all. I mean generators
and all kinds of object-oriented programs including multiple inheritance and
so much more. You can use each language for what it is good at or that work
with the way you think and with some overhead get the best of both worlds.

Of course, this comes with costs and any programs you send out to be used by
others would require both languages to be installed properly and ...

-----Original Message-----
From: R-help <r-help-bounces at r-project.org> On Behalf Of Jan van der Laan
Sent: Wednesday, November 3, 2021 5:47 AM
To: r-help at r-project.org
Subject: Re: [R] Is there a hash data structure for R
On 03-11-2021 00:42, Avi Gross via R-help wrote:

            
fail.

I think you are referring to my remark which was:

 > However, the data.frame construction method will detect this and  >
generate unique names (which also might not be what you want):

I didn't say this means that duplicate names aren't a problem; I just
mentioned the the behaviour is different. Personally, I would actually
prefer the behaviour of list (keep the duplicated name) with a warning.

Most of the responses seem to assume that the OP actually wants a hash
table. Yes, he did ask for that and for a hash table an environment (with
some work) would be a good option. But in many cases, where other languages
would use a hash-table-like object (such as a dict) in R you would use other
types of objects. Furthermore, for many operations where you might use hash
tables to implement the operation, R has already built in options, for
example %in%, match, duplicated. These are also vectorised; so two vectors:
one with keys and one with values might actually be faster than an
environment in some use cases.

Best,
Jan
environments and quadratic for lists.
string).
______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
#
Here's an interesting article:
    Collections in R: Review and Proposal
    Timothy Barry
        The R Journal
        doi: 10.32614/RJ-2018-037
    https://journal.r-project.org/archive/2018/RJ-2018-037/RJ-2018-037.pdf
On Tue, Nov 2, 2021 at 10:48 PM Yonghua Peng <yong at pobox.com> wrote: