Skip to content

[R-pkg-devel] Unused data is silently kept in the environment of a function

4 messages · Samuel Granjeaud, Duncan Murdoch

#
Dear all,

I want to compute processing functions to apply to the data.
I apply the functions to the data in a second step.
proc_0 increases the memory, proc_1 is safe.
reprex below.

If this behavior is known, could you tell me a workaround before I try 
to guess the best one?

Best,
Samuel

``` r
# for memory tracking
library(pryr)

# a class
setClass(
 ? "fb",
 ? slots = list(d = "numeric", f = "list"),
 ? prototype=list(d = NULL, f = NULL)
)

# memory increased: keep dat somewhere and link it back to the returned 
value
proc_0 <- function(x) {
 ? dat = sample(x at d)
 ? cofactors = c(mean(dat), median(dat), IQR(dat))
 ? model = sapply(cofactors, function(cofactor) function(z) z / cofactor)
 ? x at f = list(model)
 ? x
}

# init data
mem_used()
#> 47 MB
a = new("fb")
a at d = sample(rnorm(1e7))
a at f = list()
mem_used()
#> 127 MB
# memory increased of 80 MB
# process
b = proc_0(a)
mem_used()
#> 207 MB
# memory increased of 80 MB again
rm(a)
mem_used()
#> 207 MB
# memory didn't decreased
b at d = b at d + 1
mem_used()
#> 287 MB
# memory increased
# b at d was really pointing to a at d before increment
sapply(1:3, function(i) ls(environment(b at f[[1]][[i]])))
#> [1] "cofactor" "cofactor" "cofactor"
sapply(1:3, function(i) get("cofactor", environment(b at f[[1]][[i]])))
#> [1] -0.0003085559? 0.0001107148? 1.3485980291
# environments look fine
rm(b)
mem_used()
#> 47.5 MB
# memory released back


# memory safe
proc_1 <- function(x) {
 ? cofactors = c(mean(x at d), median(x at d), IQR(x at d))
 ? model = sapply(cofactors, function(cofactor) function(z) z / cofactor)
 ? x at f = list(model)
 ? x
}

# init data
mem_used()
#> 47.5 MB
a = new("fb")
a at d = sample(rnorm(1e7))
a at f = list()
mem_used()
#> 128 MB
b = proc_1(a)
mem_used()
#> 128 MB
# memory didn't increased; b at d points to a at d; functions weight a few KB
rm(a)
mem_used()
#> 128 MB
sapply(1:3, function(i) ls(environment(b at f[[1]][[i]])))
#> [1] "cofactor" "cofactor" "cofactor"
sapply(1:3, function(i) get("cofactor", environment(b at f[[1]][[i]])))
#> [1] -0.0003133312 -0.0002510665? 1.3491459433

rm(b)
mem_used()
#> 47.5 MB

```

<sup>Created on 2022-07-08 by the [reprex 
package](https://reprex.tidyverse.org) (v2.0.1)</sup>

<details style="margin-bottom:10px;">
<summary>
Session info
</summary>

``` r
sessionInfo()
#> R version 4.2.1 (2022-06-23 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19044)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=French_France.utf8 LC_CTYPE=French_France.utf8
#> [3] LC_MONETARY=French_France.utf8 LC_NUMERIC=C
#> [5] LC_TIME=French_France.utf8
#>
#> attached base packages:
#> [1] stats???? graphics? grDevices utils???? datasets methods?? base
#>
#> other attached packages:
#> [1] pryr_0.1.5
#>
#> loaded via a namespace (and not attached):
#>? [1] Rcpp_1.0.8.3???? codetools_0.2-18 digest_0.6.29 withr_2.5.0
#>? [5] magrittr_2.0.3?? reprex_2.0.1???? evaluate_0.15 highr_0.9
#>? [9] stringi_1.7.6??? rlang_1.0.3????? cli_3.3.0 rstudioapi_0.13
#> [13] fs_1.5.2???????? lobstr_1.1.2???? rmarkdown_2.14 tools_4.2.1
#> [17] stringr_1.4.0??? glue_1.6.2?????? xfun_0.31 yaml_2.3.5
#> [21] fastmap_1.1.0??? compiler_4.2.1?? htmltools_0.5.2 knitr_1.39
```

</details>
#
I accidentally replied privately to this message.  Here is the reply 
that I intended to send to the list, along with an addition based on 
Samuel's reply to me.
On 08/07/2022 9:50 a.m., Samuel Granjeaud wrote:
> > Dear all,
 > >
 > > I want to compute processing functions to apply to the data.
 > > I apply the functions to the data in a second step.
 > > proc_0 increases the memory, proc_1 is safe.
 > > reprex below.
 > >
 > > If this behavior is known, could you tell me a workaround before I try
 > > to guess the best one?

When a function is called, it creates an environment that holds the
arguments and all local variables.  If the function returns that
environment, or a value that references it, all the local variables will
still be there.

In your function I believe the anonymous functions you create in `model`
are catching the environment.  Since those functions are created as part
of the evaluation of proc_0, each of them will have the evaluation
environment attached.

NEW addition:  In R, functions have an associated environment set as the 
parent of the evaluation environment mentioned above.  Those are called 
"the environment of the function", and can be retrieved from function fn 
using `environment(fn)`.  For top-level functions like proc_0, 
environment(proc_0) would be the global environment, but for functions 
created within another function, it would be the evaluation environment 
active at the time of creation.

Your code has

   sapply(cofactors, function(cofactor) function(z) z / cofactor)

This creates the function with definition

   function(cofactor) function(z) z / cofactor

The environment of that function will be the evaluation environment of 
proc_0.  When that function is called by sapply(), it will create an 
evaluation environment holding cofactor, and that environment will be 
used by the function returned, i.e. the result of

   function(z) z / cofactor

So you'll end up with this chain of environments:

   environment(function(z) z / cofactor) is the evaluation environment 
of function(cofactor) function(z) z / cofactor;

   its parent is the evaluation environment of proc_0, containing dat;

   its parent is environment(proc_0), which is the global environment.

The global environment isn't captured, but the others are, so you save a 
copy of dat every time you call proc_0.

But none of those functions need access to dat, so there's no need to 
keep it, and after your last use of it in proc_0, just run rm(dat) to 
get rid of it.

OLD part again:

By the way, mem_used() isn't a great way to measure memory use, because
it will count things that will be cleaned up in a future garbage
collection.  When I added "rm(dat)" to your function, I saw this:

  > a = new("fb")
  > a at d = sample(rnorm(1e7))
  > a at f = list()
  > mem_used()
363 MB
  > b = proc_0(a)
  > mem_used()
283 MB

i.e. *less* memory was used after b was created, presumably because a gc
happened.

It's better to use object.size() or pryr::object_size() to measure the
size of individual objects.  Neither one is perfect: they use different
rules to decide what to include, and in some cases, memory used in one
object is counted again as part of another.  The way R allocated memory
means there is *no* perfect definition of the size of an object.

Duncan Murdoch


 > >
 > > Best,
 > > Samuel
 > >
 > > ``` r
 > > # for memory tracking
 > > library(pryr)
 > >
 > > # a class
 > > setClass(
 > >     "fb",
 > >     slots = list(d = "numeric", f = "list"),
 > >     prototype=list(d = NULL, f = NULL)
 > > )
 > >
 > > # memory increased: keep dat somewhere and link it back to the returned
 > > value
 > > proc_0 <- function(x) {
 > >     dat = sample(x at d)
 > >     cofactors = c(mean(dat), median(dat), IQR(dat))
 > >     model = sapply(cofactors, function(cofactor) function(z) z / 
cofactor)
 > >     x at f = list(model)
 > >     x
 > > }
 > >
 > > # init data
 > > mem_used()
 > > #> 47 MB
 > > a = new("fb")
 > > a at d = sample(rnorm(1e7))
 > > a at f = list()
 > > mem_used()
 > > #> 127 MB
 > > # memory increased of 80 MB
 > > # process
 > > b = proc_0(a)
 > > mem_used()
 > > #> 207 MB
 > > # memory increased of 80 MB again
 > > rm(a)
 > > mem_used()
 > > #> 207 MB
 > > # memory didn't decreased
 > > b at d = b at d + 1
 > > mem_used()
 > > #> 287 MB
 > > # memory increased
 > > # b at d was really pointing to a at d before increment
 > > sapply(1:3, function(i) ls(environment(b at f[[1]][[i]])))
 > > #> [1] "cofactor" "cofactor" "cofactor"
 > > sapply(1:3, function(i) get("cofactor", environment(b at f[[1]][[i]])))
 > > #> [1] -0.0003085559  0.0001107148  1.3485980291
 > > # environments look fine
 > > rm(b)
 > > mem_used()
 > > #> 47.5 MB
 > > # memory released back
 > >
 > >
 > > # memory safe
 > > proc_1 <- function(x) {
 > >     cofactors = c(mean(x at d), median(x at d), IQR(x at d))
 > >     model = sapply(cofactors, function(cofactor) function(z) z / 
cofactor)
 > >     x at f = list(model)
 > >     x
 > > }
 > >
 > > # init data
 > > mem_used()
 > > #> 47.5 MB
 > > a = new("fb")
 > > a at d = sample(rnorm(1e7))
 > > a at f = list()
 > > mem_used()
 > > #> 128 MB
 > > b = proc_1(a)
 > > mem_used()
 > > #> 128 MB
 > > # memory didn't increased; b at d points to a at d; functions weight a few KB
 > > rm(a)
 > > mem_used()
 > > #> 128 MB
 > > sapply(1:3, function(i) ls(environment(b at f[[1]][[i]])))
 > > #> [1] "cofactor" "cofactor" "cofactor"
 > > sapply(1:3, function(i) get("cofactor", environment(b at f[[1]][[i]])))
 > > #> [1] -0.0003133312 -0.0002510665  1.3491459433
 > >
 > > rm(b)
 > > mem_used()
 > > #> 47.5 MB
 > >
 > > ```
 > >
 > > <sup>Created on 2022-07-08 by the [reprex
 > > package](https://reprex.tidyverse.org) (v2.0.1)</sup>
 > >
 > > <details style="margin-bottom:10px;">
 > > <summary>
 > > Session info
 > > </summary>
 > >
 > > ``` r
 > > sessionInfo()
 > > #> R version 4.2.1 (2022-06-23 ucrt)
 > > #> Platform: x86_64-w64-mingw32/x64 (64-bit)
 > > #> Running under: Windows 10 x64 (build 19044)
 > > #>
 > > #> Matrix products: default
 > > #>
 > > #> locale:
 > > #> [1] LC_COLLATE=French_France.utf8 LC_CTYPE=French_France.utf8
 > > #> [3] LC_MONETARY=French_France.utf8 LC_NUMERIC=C
 > > #> [5] LC_TIME=French_France.utf8
 > > #>
 > > #> attached base packages:
 > > #> [1] stats     graphics  grDevices utils     datasets methods   base
 > > #>
 > > #> other attached packages:
 > > #> [1] pryr_0.1.5
 > > #>
 > > #> loaded via a namespace (and not attached):
 > > #>  [1] Rcpp_1.0.8.3     codetools_0.2-18 digest_0.6.29 withr_2.5.0
 > > #>  [5] magrittr_2.0.3   reprex_2.0.1     evaluate_0.15 highr_0.9
 > > #>  [9] stringi_1.7.6    rlang_1.0.3      cli_3.3.0 rstudioapi_0.13
 > > #> [13] fs_1.5.2         lobstr_1.1.2     rmarkdown_2.14 tools_4.2.1
 > > #> [17] stringr_1.4.0    glue_1.6.2       xfun_0.31 yaml_2.3.5
 > > #> [21] fastmap_1.1.0    compiler_4.2.1   htmltools_0.5.2 knitr_1.39
 > > ```
 > >
 > > </details>
 > >
 > > ______________________________________________
 > > R-package-devel at r-project.org mailing list
 > > https://stat.ethz.ch/mailman/listinfo/r-package-devel
#
On 08/07/2022 11:01 a.m., Duncan Murdoch wrote:
... lots deleted
This may be misleading.  The reference to the global environment will be 
there; what I meant is that there's no extra copy of the global 
environment.  If we didn't have the references in the chain above, the 
evaluation environments would all have been garbage collected after 
proc_0 finished, but because we still have references to them, they are 
"captured", so we keep a copy until those references go away when the 
result of the proc_0 call is deleted.

Duncan Murdoch
#
Great answer. I think I missed the point on the parent environment 
although I did my best to care about the environment.

What Ducan removed from my reply is the following and I share it with you: