allocVector bug ?
On Thursday 09 November 2006 12:21 pm, Luke Tierney wrote:
On Wed, 8 Nov 2006, Vladimir Dergachev wrote:
On Wednesday 08 November 2006 12:56 pm, Luke Tierney wrote:
On Mon, 6 Nov 2006, Vladimir Dergachev wrote:
Hi Luke,
Yes, I gladly concede the point that for a heuristic algorithm the
notion of what is a "bug" is murky (besides crashes, etc, which is not
what I am not talking about).
Here is why I called this a bug:
1. My understanding is that each time gc() needs to increase memory
it performs a full garbage collection run. Right ?
The allocation process does not call gc before every call to malloc. It only calls gc if the allocation would cross a threshold level. Those theshold levels are adjusted in an effort to compromise between keeping memory footprint low and not calling gc too often. The code you quote below is part of this adjustment process. If this process is working properly then as memory use grows there will initially be more gc activity and then less as the thresholds adjust.
Well, I was seeing it call gc for every large vector. This probably happens be only for those larger than R_VGrowIncrFrac * R_NSize. On my system R_NSize is never more than 1e6 so this would explain the problems when using 1e6 (and larger) vectors.
2. This is not a problem with small memory sizes as they imply
(presumably) small number of objects.
3. However, if one wants to allocate many objects (say columns in a
data frame or just vectors) this results in large penalty
Example 1: This simulates allocation of a data.frame with some character
columns which are assumed to be factors. On my system first assignment is
nearly instantaneous, why subsequent assignments take slightly less than
0.1 seconds each.
I'm not sure these are quite doing what you intend. You define Chars but don't use it. Also, system.time by default calls gc() before doing the evaluation. Giving FALSE as the second argument may give you a more realistic picture.
The Chars are defined to create lots of ncells and make gc() run time more realistic. It also mimics having a data.frame with a few factor columns. As for system.time - thank you, I missed that ! Setting gcFirst=FALSE changes behavior in the first example to be 2 times faster and makes all the allocations in the second example faster. I guess that extra call to gc() caused R_VSize to shrink too fast.
I looked more carefully at your code in src/main/memory.c, function
AdjustHeapSize:
R_VSize = VNeeded;
if (vect_occup > R_VGrowFrac) {
R_size_t change = R_VGrowIncrMin + R_VGrowIncrFrac * R_NSize;
if (R_MaxVSize - R_VSize >= change)
R_VSize += change;
}
Could it be that R_NSize should be R_VSize ? This would explain why I see
a problem in case R_VSize>>R_NSize.
That does indeed look like a bug and that R_NSize should be R_VSize -- well spotted, thanks. I will need to experiment with this a bit more to see if it can safely be changed. It will increase the memory footprint a bit. Probaly not by enough to matter but if it does we may need to adjust some of the tuning constants.
Would there be something I can help you with ? Is there a script to run
through common usage patterns ?
thank you !
Vladimir Dergachev
Best, luke