Object size for characters in R - How does R global string pool work?

核能气质少年 提交于 2019-12-04 19:27:11


I am reading Hadley's Advanced R Programming and when it discusses the memory size for characters it says this:

R has a global string pool. This means that each unique string is only stored in one place, and therefore character vectors take up less memory than you might expect.

The example the book gives is this:

#> 96 B
object_size(rep("banana", 10))
#> 216 B

One of the exercises in this section is to compare these two character vectors:

vec <- lapply(0:50, function(i) c("ba", rep("na", i)))
str <- lapply(vec, paste0, collapse = "")

13.4 kB

8.74 kB

Now, since the passage states that R has a global string pool, and since vector vec is composed mainly of repetitions of two strings ("ba" and "na") I actually would - intuitively - expect the size of vec to be smaller than the size of str.

So my question is: how could you most accurately estimate the size of those vectors beforehand?


The key difference is because of the pointers in vec: each of the short scalar strings (CHARSXPs) has to be pointed from the corresponding string vector (STRSXP). You have some 1326 of such string pointers inside vec, but only 51 in str (a pointer is probably 8 bytes on your platform). The pool is for scalar strings (aka CHARSXP cache). Another non-obvious factor is internal fragmentation, e.g. on my system, a scalar string takes the same size regardless of whether it has zero to 7 characters, an 8 character string only takes more, and so on. See the repeated sizes in the following:

unlist(sapply(str, object.size))

[1] 96 96 96 104 104 104 104 120 120 120 120 120 120 120 120 136 136 136 136

[20] 136 136 136 136 152 152 152 152 152 152 152 152 216 216 216 216 216 216 216

[39] 216 216 216 216 216 216 216 216 216 216 216 216 216

These are, however, implementation details of R's memory manager that could change and one should not depend on them in any way in user programs - with another object layout/memory manager, str could use more space than vec.