I would like to construct a dataframe row-by-row in R. I\'ve done some searching, and all I came up with is the suggestion to create an empty list, keep a list index scalar,
The reason I like Rcpp so much is that I don't always get how R Core thinks, and with Rcpp, more often than not, I don't have to.
Speaking philosophically, you're in a state of sin with regards to the functional paradigm, which tries to ensure that every value appears independent of every other value; changing one value should never cause a visible change in another value, the way you get with pointers sharing representation in C.
The problems arise when functional programming signals the small craft to move out of the way, and the small craft replies "I'm a lighthouse". Making a long series of small changes to a large object which you want to process on in the meantime puts you square into lighthouse territory.
In the C++ STL, push_back()
is a way of life. It doesn't try to be functional, but it does try to accommodate common programming idioms efficiently.
With some cleverness behind the scenes, you can sometimes arrange to have one foot in each world. Snapshot based file systems are a good example (which evolved from concepts such as union mounts, which also ply both sides).
If R Core wanted to do this, underlying vector storage could function like a union mount. One reference to the vector storage might be valid for subscripts 1:N
, while another reference to the same storage is valid for subscripts 1:(N+1)
. There could be reserved storage not yet validly referenced by anything but convenient for a quick push_back()
. You don't violate the functional concept when appending outside the range that any existing reference considers valid.
Eventually appending rows incrementally, you run out of reserved storage. You'll need to create new copies of everything, with the storage multiplied by some increment. The STL implementations I've use tend to multiply storage by 2 when extending allocation. I thought I read in R Internals that there is a memory structure where the storage increments by 20%. Either way, growth operations occur with logarithmic frequency relative to the total number of elements appended. On an amortized basis, this is usually acceptable.
As tricks behind the scenes go, I've seen worse. Every time you push_back()
a new row onto the dataframe, a top level index structure would need to be copied. The new row could append onto shared representation without impacting any old functional values. I don't even think it would complicate the garbage collector much; since I'm not proposing push_front()
all references are prefix references to the front of the allocated vector storage.