Dear all,
I want to compute processing functions to apply to the data.
I apply the functions to the data in a second step.
proc_0 increases the memory, proc_1 is safe.
reprex below.
If this behavior is known, could you tell me a workaround before I try
to guess the best one?
Best,
Samuel
``` r
# for memory tracking
library(pryr)
# a class
setClass(
? "fb",
? slots = list(d = "numeric", f = "list"),
? prototype=list(d = NULL, f = NULL)
)
# memory increased: keep dat somewhere and link it back to the returned
value
proc_0 <- function(x) {
? dat = sample(x at d)
? cofactors = c(mean(dat), median(dat), IQR(dat))
? model = sapply(cofactors, function(cofactor) function(z) z / cofactor)
? x at f = list(model)
? x
}
# init data
mem_used()
#> 47 MB
a = new("fb")
a at d = sample(rnorm(1e7))
a at f = list()
mem_used()
#> 127 MB
# memory increased of 80 MB
# process
b = proc_0(a)
mem_used()
#> 207 MB
# memory increased of 80 MB again
rm(a)
mem_used()
#> 207 MB
# memory didn't decreased
b at d = b at d + 1
mem_used()
#> 287 MB
# memory increased
# b at d was really pointing to a at d before increment
sapply(1:3, function(i) ls(environment(b at f[[1]][[i]])))
#> [1] "cofactor" "cofactor" "cofactor"
sapply(1:3, function(i) get("cofactor", environment(b at f[[1]][[i]])))
#> [1] -0.0003085559? 0.0001107148? 1.3485980291
# environments look fine
rm(b)
mem_used()
#> 47.5 MB
# memory released back
# memory safe
proc_1 <- function(x) {
? cofactors = c(mean(x at d), median(x at d), IQR(x at d))
? model = sapply(cofactors, function(cofactor) function(z) z / cofactor)
? x at f = list(model)
? x
}
# init data
mem_used()
#> 47.5 MB
a = new("fb")
a at d = sample(rnorm(1e7))
a at f = list()
mem_used()
#> 128 MB
b = proc_1(a)
mem_used()
#> 128 MB
# memory didn't increased; b at d points to a at d; functions weight a few KB
rm(a)
mem_used()
#> 128 MB
sapply(1:3, function(i) ls(environment(b at f[[1]][[i]])))
#> [1] "cofactor" "cofactor" "cofactor"
sapply(1:3, function(i) get("cofactor", environment(b at f[[1]][[i]])))
#> [1] -0.0003133312 -0.0002510665? 1.3491459433
rm(b)
mem_used()
#> 47.5 MB
```
<sup>Created on 2022-07-08 by the [reprex
package](https://reprex.tidyverse.org) (v2.0.1)</sup>
<details style="margin-bottom:10px;">
<summary>
Session info
</summary>
``` r
sessionInfo()
#> R version 4.2.1 (2022-06-23 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19044)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=French_France.utf8 LC_CTYPE=French_France.utf8
#> [3] LC_MONETARY=French_France.utf8 LC_NUMERIC=C
#> [5] LC_TIME=French_France.utf8
#>
#> attached base packages:
#> [1] stats???? graphics? grDevices utils???? datasets methods?? base
#>
#> other attached packages:
#> [1] pryr_0.1.5
#>
#> loaded via a namespace (and not attached):
#>? [1] Rcpp_1.0.8.3???? codetools_0.2-18 digest_0.6.29 withr_2.5.0
#>? [5] magrittr_2.0.3?? reprex_2.0.1???? evaluate_0.15 highr_0.9
#>? [9] stringi_1.7.6??? rlang_1.0.3????? cli_3.3.0 rstudioapi_0.13
#> [13] fs_1.5.2???????? lobstr_1.1.2???? rmarkdown_2.14 tools_4.2.1
#> [17] stringr_1.4.0??? glue_1.6.2?????? xfun_0.31 yaml_2.3.5
#> [21] fastmap_1.1.0??? compiler_4.2.1?? htmltools_0.5.2 knitr_1.39
```
</details>
[R-pkg-devel] Unused data is silently kept in the environment of a function
4 messages · Samuel Granjeaud, Duncan Murdoch
I accidentally replied privately to this message. Here is the reply that I intended to send to the list, along with an addition based on Samuel's reply to me.
On 08/07/2022 9:50 a.m., Samuel Granjeaud wrote:
> > Dear all,
> >
> > I want to compute processing functions to apply to the data.
> > I apply the functions to the data in a second step.
> > proc_0 increases the memory, proc_1 is safe.
> > reprex below.
> >
> > If this behavior is known, could you tell me a workaround before I try
> > to guess the best one?
When a function is called, it creates an environment that holds the
arguments and all local variables. If the function returns that
environment, or a value that references it, all the local variables will
still be there.
In your function I believe the anonymous functions you create in `model`
are catching the environment. Since those functions are created as part
of the evaluation of proc_0, each of them will have the evaluation
environment attached.
NEW addition: In R, functions have an associated environment set as the
parent of the evaluation environment mentioned above. Those are called
"the environment of the function", and can be retrieved from function fn
using `environment(fn)`. For top-level functions like proc_0,
environment(proc_0) would be the global environment, but for functions
created within another function, it would be the evaluation environment
active at the time of creation.
Your code has
sapply(cofactors, function(cofactor) function(z) z / cofactor)
This creates the function with definition
function(cofactor) function(z) z / cofactor
The environment of that function will be the evaluation environment of
proc_0. When that function is called by sapply(), it will create an
evaluation environment holding cofactor, and that environment will be
used by the function returned, i.e. the result of
function(z) z / cofactor
So you'll end up with this chain of environments:
environment(function(z) z / cofactor) is the evaluation environment
of function(cofactor) function(z) z / cofactor;
its parent is the evaluation environment of proc_0, containing dat;
its parent is environment(proc_0), which is the global environment.
The global environment isn't captured, but the others are, so you save a
copy of dat every time you call proc_0.
But none of those functions need access to dat, so there's no need to
keep it, and after your last use of it in proc_0, just run rm(dat) to
get rid of it.
OLD part again:
By the way, mem_used() isn't a great way to measure memory use, because
it will count things that will be cleaned up in a future garbage
collection. When I added "rm(dat)" to your function, I saw this:
> a = new("fb")
> a at d = sample(rnorm(1e7))
> a at f = list()
> mem_used()
363 MB
> b = proc_0(a)
> mem_used()
283 MB
i.e. *less* memory was used after b was created, presumably because a gc
happened.
It's better to use object.size() or pryr::object_size() to measure the
size of individual objects. Neither one is perfect: they use different
rules to decide what to include, and in some cases, memory used in one
object is counted again as part of another. The way R allocated memory
means there is *no* perfect definition of the size of an object.
Duncan Murdoch
> >
> > Best,
> > Samuel
> >
> > ``` r
> > # for memory tracking
> > library(pryr)
> >
> > # a class
> > setClass(
> > "fb",
> > slots = list(d = "numeric", f = "list"),
> > prototype=list(d = NULL, f = NULL)
> > )
> >
> > # memory increased: keep dat somewhere and link it back to the returned
> > value
> > proc_0 <- function(x) {
> > dat = sample(x at d)
> > cofactors = c(mean(dat), median(dat), IQR(dat))
> > model = sapply(cofactors, function(cofactor) function(z) z /
cofactor)
> > x at f = list(model)
> > x
> > }
> >
> > # init data
> > mem_used()
> > #> 47 MB
> > a = new("fb")
> > a at d = sample(rnorm(1e7))
> > a at f = list()
> > mem_used()
> > #> 127 MB
> > # memory increased of 80 MB
> > # process
> > b = proc_0(a)
> > mem_used()
> > #> 207 MB
> > # memory increased of 80 MB again
> > rm(a)
> > mem_used()
> > #> 207 MB
> > # memory didn't decreased
> > b at d = b at d + 1
> > mem_used()
> > #> 287 MB
> > # memory increased
> > # b at d was really pointing to a at d before increment
> > sapply(1:3, function(i) ls(environment(b at f[[1]][[i]])))
> > #> [1] "cofactor" "cofactor" "cofactor"
> > sapply(1:3, function(i) get("cofactor", environment(b at f[[1]][[i]])))
> > #> [1] -0.0003085559 0.0001107148 1.3485980291
> > # environments look fine
> > rm(b)
> > mem_used()
> > #> 47.5 MB
> > # memory released back
> >
> >
> > # memory safe
> > proc_1 <- function(x) {
> > cofactors = c(mean(x at d), median(x at d), IQR(x at d))
> > model = sapply(cofactors, function(cofactor) function(z) z /
cofactor)
> > x at f = list(model)
> > x
> > }
> >
> > # init data
> > mem_used()
> > #> 47.5 MB
> > a = new("fb")
> > a at d = sample(rnorm(1e7))
> > a at f = list()
> > mem_used()
> > #> 128 MB
> > b = proc_1(a)
> > mem_used()
> > #> 128 MB
> > # memory didn't increased; b at d points to a at d; functions weight a few KB
> > rm(a)
> > mem_used()
> > #> 128 MB
> > sapply(1:3, function(i) ls(environment(b at f[[1]][[i]])))
> > #> [1] "cofactor" "cofactor" "cofactor"
> > sapply(1:3, function(i) get("cofactor", environment(b at f[[1]][[i]])))
> > #> [1] -0.0003133312 -0.0002510665 1.3491459433
> >
> > rm(b)
> > mem_used()
> > #> 47.5 MB
> >
> > ```
> >
> > <sup>Created on 2022-07-08 by the [reprex
> > package](https://reprex.tidyverse.org) (v2.0.1)</sup>
> >
> > <details style="margin-bottom:10px;">
> > <summary>
> > Session info
> > </summary>
> >
> > ``` r
> > sessionInfo()
> > #> R version 4.2.1 (2022-06-23 ucrt)
> > #> Platform: x86_64-w64-mingw32/x64 (64-bit)
> > #> Running under: Windows 10 x64 (build 19044)
> > #>
> > #> Matrix products: default
> > #>
> > #> locale:
> > #> [1] LC_COLLATE=French_France.utf8 LC_CTYPE=French_France.utf8
> > #> [3] LC_MONETARY=French_France.utf8 LC_NUMERIC=C
> > #> [5] LC_TIME=French_France.utf8
> > #>
> > #> attached base packages:
> > #> [1] stats graphics grDevices utils datasets methods base
> > #>
> > #> other attached packages:
> > #> [1] pryr_0.1.5
> > #>
> > #> loaded via a namespace (and not attached):
> > #> [1] Rcpp_1.0.8.3 codetools_0.2-18 digest_0.6.29 withr_2.5.0
> > #> [5] magrittr_2.0.3 reprex_2.0.1 evaluate_0.15 highr_0.9
> > #> [9] stringi_1.7.6 rlang_1.0.3 cli_3.3.0 rstudioapi_0.13
> > #> [13] fs_1.5.2 lobstr_1.1.2 rmarkdown_2.14 tools_4.2.1
> > #> [17] stringr_1.4.0 glue_1.6.2 xfun_0.31 yaml_2.3.5
> > #> [21] fastmap_1.1.0 compiler_4.2.1 htmltools_0.5.2 knitr_1.39
> > ```
> >
> > </details>
> >
> > ______________________________________________
> > R-package-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-package-devel
On 08/07/2022 11:01 a.m., Duncan Murdoch wrote:
... lots deleted
So you'll end up with this chain of environments:
environment(function(z) z / cofactor) is the evaluation environment
of function(cofactor) function(z) z / cofactor;
its parent is the evaluation environment of proc_0, containing dat;
its parent is environment(proc_0), which is the global environment.
The global environment isn't captured, but the others are, so you save a
copy of dat every time you call proc_0.
This may be misleading. The reference to the global environment will be there; what I meant is that there's no extra copy of the global environment. If we didn't have the references in the chain above, the evaluation environments would all have been garbage collected after proc_0 finished, but because we still have references to them, they are "captured", so we keep a copy until those references go away when the result of the proc_0 call is deleted. Duncan Murdoch
Great answer. I think I missed the point on the parent environment although I did my best to care about the environment. What Ducan removed from my reply is the following and I share it with you:
Great to have feedback from persons like you, Samuel