stopping finalizers
The subset table isn't a copy of the subset, it contains the unique key and an indicator column showing whether the element is in the subset. I need this even if the subset is never modified, so that I can join it to the main table and use it in SQL 'where' conditions to get computations for the right subset of the data.
Cool - Is that faster than storing a column that just contains the include indices?
The whole point of this new sqlsurvey package is that most of the aggregation operations happen in the database rather than in R, which is faster for very large data tables. The use case is things like the American Community Survey and the Nationwide Emergency Department Subsample, with millions or tens of millions of records and quite a lot of variables. At this scale, loading stuff into memory isn't feasible on commodity desktops and laptops, and even on computers with enough memory, the database (MonetDB) is faster.
Have you done any comparisons of monetdb vs sqlite - I'm interested to know how much faster it is. I'm working on a package (https://github.com/hadley/dplyr) that compiles R data manipulation expressions into (e.g. SQL), and have been wondering if it's worth considering a column-store like monetdb. Hadley
Chief Scientist, RStudio http://had.co.nz/