Skip to content

Large vector support in data.frames

4 messages · Ivan Krylov, Simon Urbanek, Jan van der Laan +1 more

#
? Wed, 19 Jun 2024 09:52:20 +0200
Jan van der Laan <rhelp at eoos.dds.nl> ?????:
Apologies if you've already received a better answer off-list.
#
The second point is not really an issue - R already uses numerics for larger-than-32-bit indexing at R level and it works just fine for objects up to ca. 72 petabytes.

However, the first one is a bit more relevant than one would think. At one point I have experimented with allowing data frames with more than 2^31 rows, but it breaks in many places - some quite unexpected. Beside dim() there is also the issue with (non-expanded) row names. Overall, it is a lot more work - some would have to be done in R but some would require changes to packages as well.

(In practice I use sharded data frames for large data which removes the limit and allows parallel processing - but requires support from the methods that will be applied to them).

Cheers,
Simon
#
Ivan, Simon,

Thanks for the replies.

I can work around the limitation. I currently either divide the data 
into shards or use a list with (long) vectors depending on what I am 
trying to do. But I have to transform between the two representations 
which takes time and memory and often need more code than I would have 
if I could have used data.frames.

Being able to create large (> 2^31-1 rows) data.frames and doing some 
basic things like selecting rows and columns, would already be really 
nice. That would also allow package maintainers to start supporting 
these data.frames. I imagine getting large data.frames working in 
functions like lm, is not trivial and lm might not support this any time 
soon. However, a package like biglm might.

But from what you are saying, I get the impression that this is not 
something that is being actively worked on. I must say, my hands a kind 
of itching to try.

Best,
Jan
On 03-07-2024 09:22, Simon Urbanek wrote:
#
Unfortunately, as has been noted, some changes require many parties to change at once and can cause huge problems when an unchanged part is reached. If integers are a fixed size, an implementation can be straightforward and you can patch in libraries and parts already used and tested and in languages like C.

Python is an example where they went another way and the built-in integer type has an indefinite length integer. But that can mess with efficiency so some extensions commonly used for their versions of Dataframe often allow you to specify one of several types of fixed length integer for efficiency.

-----Original Message-----
From: R-devel <r-devel-bounces at r-project.org> On Behalf Of Jan van der Laan
Sent: Thursday, July 4, 2024 2:38 AM
To: r-devel at r-project.org
Subject: Re: [Rd] Large vector support in data.frames

Ivan, Simon,

Thanks for the replies.

I can work around the limitation. I currently either divide the data 
into shards or use a list with (long) vectors depending on what I am 
trying to do. But I have to transform between the two representations 
which takes time and memory and often need more code than I would have 
if I could have used data.frames.

Being able to create large (> 2^31-1 rows) data.frames and doing some 
basic things like selecting rows and columns, would already be really 
nice. That would also allow package maintainers to start supporting 
these data.frames. I imagine getting large data.frames working in 
functions like lm, is not trivial and lm might not support this any time 
soon. However, a package like biglm might.

But from what you are saying, I get the impression that this is not 
something that is being actively worked on. I must say, my hands a kind 
of itching to try.

Best,
Jan
On 03-07-2024 09:22, Simon Urbanek wrote:
______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel