Skip to content

4-int indexing limit of R {Re: [R] allocMatrix limits}

4 messages · Vadim Kutsyy, Martin Maechler, Brian Ripley

#
[[Topic diverted from R-help]]

        
VK> Martin Maechler wrote:
>> 
      VK> The problem is in array.c, where allocMatrix check for
      VK> "if ((double)nrow * (double)ncol > INT_MAX)".  But why
      VK> itn is used and not long int for indexing? (max int is
      VK> 2147483647, max long int is 9223372036854775807)

    >> Well, Brian gave you all info:
    >>  (  ?Memory-limits )

    VK> exactly, and given that most modern system used for
    VK> computations (i.e.  64bit system) have long int which is
    VK> much larger than int, I am wondering why long int is not
    VK> used for indexing (I don't think that 4 bit vs 8 bit
    VK> storage is an issue).

Well, fortunately, reasonable compilers have indeed kept 
'long' == 'long int'  to mean 32-bit integers
((less reasonable compiler writers have not, AFAIK: which leads
  of course to code that no longer compiles correctly when
  originally it did))
But of course you are right that  64-bit integers
(typically == 'long long', and really == 'int64') are very
natural on 64-bit architectures.
But see below.


    >> Did you really carefully read ?Memory-limits ??

    VK> Yes, it is specify that 4 bit int is used for indexing
    VK> in all version of R, but why? I think 2147483647
    VK> elements for a single vector is OK, but not as total
    VK> number of elements for the matrix.  I am running out of
    VK> indexing at mere 10% memory consumption.

If you have too large a numeric matrix, it would be larger than
2^31 * 8 bytes ~=  2^34 / 2^20 ~= 16'000 Megabytes.
If that is is 10% only for you,  you'd have around 160 GB of
RAM.  That's quite a impressive.
I agree that it is at least in the "ball park" of what is
available today.


[........]

    VK> PS: I have no problem to go and modify C code, but I am
    VK> just wondering what are the reasons for having such
    VK> limitation.

Compatibility for one:

Note that R objects are (pointers to) C structs that are
"well-defined" platform independently, and I'd say that this
should remain so.
Consequently 64ints (or another "longer int"), would have to be
there "in R", also on 32bit platforms. That may well be feasible,
but it would double the size of quite a few objects.

I think what you are implicitly proposing is that
we'd want 64-bit integer as an R-level type, and that are R
would use (and/or coerce to it from 'int32') for indexing
everywhere.

But more importantly, all (or very much of) the currently
existing C- and Fortran-code (called via .Call(), .C(), .Fortran)
would also have to be able to deal with the "longer ints".

One of the last times this topic came up (within R-core),
we found that for all the matrix/vector operations,
we really would need versions of  BLAS / LAPACK that would also
work with these "big" matrices, ie. such a BLAS/Lapack would
also have to internally use "longer int" for indexing.
At that point in time, we had decied we would at least wait to
hear about the development of such BLAS/LAPACK libraries.

Interested to hear other opinions / get more info on this topic.
I do agree that it would be nice to get over this limit within a
few years.

Martin
#
Martin Maechler wrote:
well in 64bit Ubunty, /usr/include/limits.h defines:

/* Minimum and maximum values a `signed long int' can hold.  */
#  if __WORDSIZE == 64
#   define LONG_MAX     9223372036854775807L
#  else
#   define LONG_MAX     2147483647L
#  endif
#  define LONG_MIN      (-LONG_MAX - 1L)

and using simple code to test 
(http://home.att.net/~jackklein/c/inttypes.html#int) my desktop, which 
is standard Intel computer, does show.

Signed long min: -9223372036854775808 max: 9223372036854775807
>  cat /proc/meminfo | grep MemTotal
MemTotal:     145169248 kB

We have "smaller" SGI NUMAflex to play with, where the memory can 
increased to 512Gb ("larger" version doesn't have this "limitation").  
But with even commodity hardware you can easily get 128Gb for reasonable 
price (i.e. Dell PowerEdge R900)
I forgot that R stores two dimensional array in a single dimensional  C 
array. Now I understand why there is a limitation on total number of 
elements.  But this is a big limitations.
BLAS supports two dimensional metrics definition, so if we would store 
matrix as two dimensional object, we would be fine.  But than all R code 
as well as all packages would have to be modified.
#

        
VK> Martin Maechler wrote:
>> [[Topic diverted from R-help]]
    >> 
    >> Well, fortunately, reasonable compilers have indeed kept
    >> 'long' == 'long int' to mean 32-bit integers ((less
    >> reasonable compiler writers have not, AFAIK: which leads
    >> of course to code that no longer compiles correctly when
    >> originally it did)) But of course you are right that
    >> 64-bit integers (typically == 'long long', and really ==
    >> 'int64') are very natural on 64-bit architectures.  But
    >> see below.

... I wrote complete rubbish, 
and I am embarrassed ...

    >> 
    VK> well in 64bit Ubunty, /usr/include/limits.h defines:

    VK> /* Minimum and maximum values a `signed long int' can hold.  */
    VK> #  if __WORDSIZE == 64
    VK> #   define LONG_MAX     9223372036854775807L
    VK> #  else
    VK> #   define LONG_MAX     2147483647L
    VK> #  endif
    VK> #  define LONG_MIN      (-LONG_MAX - 1L)

    VK> and using simple code to test 
    VK> (http://home.att.net/~jackklein/c/inttypes.html#int) my desktop, which 
    VK> is standard Intel computer, does show.

    VK> Signed long min: -9223372036854775808 max: 9223372036854775807

yes.  I am really embarrassed.

What I was trying to say was that
the definition of  int / long /...  should not change when going
from 32bit architecture to  64bit 
and that the R internal structures consequently should also be
the same on 32-bit and 64-bit platforms

    >> If you have too large a numeric matrix, it would be larger than
    >> 2^31 * 8 bytes ~=  2^34 / 2^20 ~= 16'000 Megabytes.
    >> If that is is 10% only for you,  you'd have around 160 GB of
    >> RAM.  That's quite a impressive.
    >> 
    >> cat /proc/meminfo | grep MemTotal
    VK> MemTotal:     145169248 kB

    VK> We have "smaller" SGI NUMAflex to play with, where the memory can 
    VK> increased to 512Gb ("larger" version doesn't have this "limitation").  
    VK> But with even commodity hardware you can easily get 128Gb for reasonable 
    VK> price (i.e. Dell PowerEdge R900)

    >> Note that R objects are (pointers to) C structs that are
    >> "well-defined" platform independently, and I'd say that this
    >> should remain so.

    >> 
    VK> I forgot that R stores two dimensional array in a single dimensional  C 
    VK> array. Now I understand why there is a limitation on total number of 
    VK> elements.  But this is a big limitations.

Yes, maybe

    >> One of the last times this topic came up (within R-core),
    >> we found that for all the matrix/vector operations,
    >> we really would need versions of  BLAS / LAPACK that would also
    >> work with these "big" matrices, ie. such a BLAS/Lapack would
    >> also have to internally use "longer int" for indexing.
    >> At that point in time, we had decied we would at least wait to
    >> hear about the development of such BLAS/LAPACK libraries

    VK> BLAS supports two dimensional metrics definition, so if we would store 
    VK> matrix as two dimensional object, we would be fine.  But than all R code 
    VK> as well as all packages would have to be modified.

exactly.  And that was what I meant when I said "Compatibility".

But rather than changing the  
 "matrix = colmunwise stored as long vector" paradigm, should
rather change from 32-bit indexing to longer one.

The hope is that we eventually make up a scheme
which would basically allow to just recompile all packages :

In src/include/Rinternals.h,
we have had the following three lines for several years now:
------------------------------------------------------------------------------------
/* type for length of vectors etc */
typedef int R_len_t; /* will be long later, LONG64 or ssize_t on Win64 */
#define R_LEN_T_MAX INT_MAX
------------------------------------------------------------------------------------

and you are right, that it may be time to experiment a bit more
with replacing 'int' with long (and also the corresponding _MAX)
setting there,
and indeed, in the array.c  code you cited, should repalce
INT_MAX  by  R_LEN_T_MAX

This still does not solve the problem that we'd have to get to
a BLAS / Lapack version that correctly works with "long indices"...
which may (or may not) be easier than I had thought.

Martin
1 day later
#
There are several issues here, and a good knowledge of the R Internals 
manual seems a prerequisite (and, considering where this started, of the 
relevant help pages!).

R uses its integer type for indexing vectors and arrays (which can be 
indexed as vectors or via an appropriate number of indices).  So if we 
allow more than 2^31-1 elements we need a way to index them in R.  One 
idea would be to make R's integer type int64 on suitable platforms, but 
that would have really extensive ramifications (in the R sources and in 
packages).  Another idea (I think suggested by Luke Tierney) is to allow 
double() indices which might be less disruptive.  Adding a new type would 
be a very serious amount of work (speaking as someone who has done it).

Another issue is the use of Fortran for matrix algebra.  That is likely to 
cause portability issues, and there's no point in doing it unless one has 
an efficient implementation of e.g. BLAS/LAPACK, the reference 
implementation being slow at even 100 million.  (That's probably not an 
empty set, as I see ACML has a int64 BLAS.)

There are lots of portability issues -- e.g. the current save() format is 
the same on all systems and we have complete interoperability.  (That 
cannot be the case if we allow big vectors to be saved.)

But at present I don't see a significant number of applications any time 
soon.  2 billion items in a homogenous group is a *lot* of data.  I know 
there are applications with 2 billion items of data already, but is it 
appropriate to store them in a single vector or matrix rather than say a 
data frame or DBMS tables?  And will there 'ever' be more than a tiny 
demand for such applications?  (I am excluding mainly-zero vectors, and 
Martin has already pointed out that we have ways to deal with those.)

It is three or four years since we first discussed some of the options, 
and at the time we thought it would be about five years before suitably 
large machines became available to more than a handful of R users.  That 
still seems about right: >=128GB systems (which is about what you need 
for larger than 16GB objects) may start to become non-negligible in a year 
or two.

R is a volunteer project with limited resources -- there are AFAIK less 
than a handful of people with the knowledge of R internals to tackle these 
issues.  Only if one of them has a need to work with larger datasets is 
this likely to be worked on.