Hi all,
I don't have a solution yet, but a bit more here:
@7f913826d590 14 REALSXP g0c0 [REF(1)] wrapper [srt=-2147483648,no_na=0]
@7f9137500320 14 REALSXP g0c7 [REF(2),ATT] (len=100, tl=0)
0.45384,0.926371,0.838637,-1.71485,-0.719073,...
ATTRIB:
@7f913826dc20 02 LISTSXP g0c0 [REF(1)]
TAG: @7f91378538d0 01 SYMSXP g0c0 [MARK,REF(460)] "data"
@7f9118310000 14 REALSXP g0c7 [REF(2)] (len=1000000, tl=0)
0.66682,0.480576,-1.13229,0.453313,-0.819498,...
attr(x2b, "data") <- "small"
@7f913826d590 14 REALSXP g0c0 [REF(1),ATT] wrapper
[srt=-2147483648,no_na=0]
@7f9137500320 14 REALSXP g0c7 [REF(2),ATT] (len=100, tl=0)
0.45384,0.926371,0.838637,-1.71485,-0.719073,...
ATTRIB:
@7f913826dc20 02 LISTSXP g0c0 [REF(1)]
TAG: @7f91378538d0 01 SYMSXP g0c0 [MARK,REF(461)] "data"
@7f9118310000 14 REALSXP g0c7 [REF(2)] (len=1000000, tl=0)
0.66682,0.480576,-1.13229,0.453313,-0.819498,...
ATTRIB:
@7f913826c870 02 LISTSXP g0c0 [REF(1)]
TAG: @7f91378538d0 01 SYMSXP g0c0 [MARK,REF(461)] "data"
@7f9120580850 16 STRSXP g0c1 [REF(3)] (len=1, tl=0)
@7f91205808c0 09 CHARSXP g0c1 [REF(3),gp=0x60] [ASCII] [cached]
"small"
So we can see that the assignment of attr(x2b, "data") IS doing something,
but it isn't doing the right thing. The fact that the above code assigned
null instead of a value was hiding this.
I will dig into this more if someone doesn't get it fixed before me, but
it won't be until after useR, because I'm preparing multiple talks for that
and it is this coming week.
Best,
~G
On Fri, Jul 2, 2021 at 9:15 PM Zafer Barutcuoglu <
zafer.barutcuoglu at gmail.com> wrote:
Hi all,
Setting names/dimnames on vectors/matrices of length>=64 returns an
ALTREP wrapper which internally still contains the names/dimnames, and
calling base::serialize on the result writes them out. They are
unserialized in the same way, with the names/dimnames hidden in the ALTREP
wrapper, so the problem is not obvious except in wasted time, bandwidth, or
disk space.
Example:
v1 <- setNames(rnorm(64), paste("element name", 1:64))
v2 <- unname(v1)
names(v2)
# NULL
length(serialize(v1, NULL))
# [1] 2039
length(serialize(v2, NULL))
# [1] 2132
length(serialize(v2[TRUE], NULL))
# [1] 543
con <- rawConnection(raw(), "w")
serialize(v2, con)
v3 <- unserialize(rawConnectionValue(con))
names(v3)
# NULL
length(serialize(v3, NULL))
# 2132
# Similarly for matrices:
m1 <- matrix(rnorm(64), 8, 8, dimnames=list(paste("row name", 1:8),
paste("col name", 1:8)))
m2 <- unname(m1)
dimnames(m2)
# NULL
length(serialize(m1, NULL))
# [1] 918
length(serialize(m2, NULL))
# [1] 1035
length(serialize(m2[TRUE, TRUE], NULL))
# 582
Previously discussed here, too:
https://r.789695.n4.nabble.com/Invisible-names-problem-td4764688.html
This happens with other attributes as well, but less predictably:
x1 <- structure(rnorm(100), data=rnorm(1000000))
x2 <- structure(x1, data=NULL)
length(serialize(x1, NULL))
# [1] 8000952
length(serialize(x2, NULL))
# [1] 924
x1b <- rnorm(100)
attr(x1b, "data") <- rnorm(1000000)
x2b <- x1b
attr(x2b, "data") <- NULL
length(serialize(x1b, NULL))
# [1] 8000863
length(serialize(x2b, NULL))
# [1] 8000956
This is pretty severe, trying to track down why serializing a small
object kills the network, because of which large attributes it may have
once had during its lifetime around the codebase that are still secretly
tagging along.
Is there a plan to resolve this? Any suggestions for maybe a C++
workaround until then? Or an alternative performant serialization solution?
Best,
--
Zafer
[[alternative HTML version deleted]]