Skip to content

nmax parameter in factor function

4 messages · Ramnik Bansal, Bert Gunter, Peter Dalgaard

#
I have been trying to understand how the argument 'nmax' works in
'factor' function. R-Documentation states - "Since factors typically
have quite a small number of levels, for large vectors x it is helpful
to supply nmax as an upper bound on the number of unique values."

In the code below what is the reason for error when value of nmax is
24. Why did the same error not occur with nmax = 25  and also how come
there are 26 levels when nmax = 25 ?
[1] a b c d e f g h i j k l m n o p q r s t u v w x y z
Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
[1] a b c d e f g h i j k l m n o p q r s t u v w x y z
Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
Error in unique.default(x, nmax = nmax) : hash table is full
#
Well, you won't like this, but it is kind of wimpily (is that a word?)
documented:

If you check the code of factor(), you will see that nmax appears as
an argument in a call to unique(). ?unique says for nmax, "... see
duplicated" . And ?duplicated says:

"If nmax is set too small there is liable to be an error: nmax = 1 is
silently ignored."

So sometimes you get an error when nmax is too small with the hash
table error message; and sometimes you just apparently get the nmax
argument ignored:
[1] TRUE

and that, to paraphrase what Roger Hammerstein said about Kansas City,
is about "as fer as I can go."

(http://lyricsplayground.com/alpha/songs/e/everythingsuptodateinkansascity.shtml)

Cheers,
Bert



Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Sat, Jun 3, 2017 at 6:14 PM, Ramnik Bansal <ramnik.bansal at gmail.com> wrote:
#
I'll go just a bit "fer-er." It appears the anomaly -- I hesitate to
call it a bug -- is in the C code for duplicated.default():
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Error in duplicated.default(letters[1:10], nmax = 8) : hash table is full

Cleverer folks than I must now explain (and possibly correct me).

Cheers,
Bert


Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Sat, Jun 3, 2017 at 9:11 PM, Bert Gunter <bgunter.4567 at gmail.com> wrote:
#
No anomaly, it is just that you need to know what it is for, before trying to use it. 

Basically, duplicated() works by looking up entries in a hash table (for which there is a substantial literature, just google it). This will be somewhat more efficient if you know the number of  unique values in advance (otherwise the table is the same size as the input vector), so you have the option of setting nmax. If you set nmax too small, you get to keep both pieces. 

nmax is directly linked to a variable in C code, and I expect that 0-based indexing is the reason that nmax can be one less than the actual number of unique values.

-pd