Why aren't FingerTrees used enough to have a stable implementation?

☆樱花仙子☆ 提交于 2019-12-04 10:13:55

问题


A while ago, I ran across an article on FingerTrees (See Also an accompanying Stack Overflow Question) and filed the idea away. I have finally found a reason to make use of them.

My problem is that the Data.FingerTree package seems to have a little bit rot around the edges. Moreover, Data.Sequence in the Containers package which makes use of the data structure re-implements a (possibly better) version, but doesn't export it.

As theoretically useful as this structure seems to be, it doesn't seem to get a lot of actual use or attention. Have people found that FingerTrees are not useful as a practical matter, or is this a case not enough attention?


further explanation:

I'm interested in building a data structure holding text that has good concatenation properties. Think about building an HTML document from assorted fragments. Most pre-built solutions use bytestrings, but I really want something that deals with Unicode text properly. My plan at the moment is to layer Data.Text fragments into a FingerTree.

I would also like to borrow the trick from Data.Vector of taking slices without copying using (offset,length) manipulation. Data.Text.Text has this built in to the data type, but only uses it for efficient uncons and unsnoc opperations. In FingerTree this information could very easily becomes the v or annotation of the tree.


回答1:


To answer your question about finger trees in particular, I think the problem is that they have relatively high constant costs compared to arrays, and are more complex than other ways of achieving efficient concatenation. A Builder has a more efficient interface for just appending chunks, and they're usually readily available (see the links in @informatikr's answer). Suppose that Data.Text.Lazy is implemented with a linked list of chunks, and you're creating a Data.Text.Lazy from a builder. Unless you have a lot of chunks (probably more than 50), or are accessing data near the end of the list repeatedly, the high constant cost of a finger tree probably isn't worth it.

The Data.Sequence implementation is specialized for performance reasons, and isn't as general as the full interface provided by the fingertree package. That's why it isn't exported; it's not really possible to use it for anything other than a Sequence.

I also suspect that many programmers are at a loss as to how to actually use the monoidal annotation, as it's behind a fairly significant abstraction barrier. So many people wouldn't use it because they don't see how it can be useful compared to other data types.

I didn't really get it until I read Chung-chieh Shan's blog series on word numbers (part2, part3, part4). That's proof that the idea can definitely be used in practical code.

In your case, if you need to both inspect partial results and have efficient appends, using a fingertree may be better than a builder. Depending on the builder's implementation, you may end up doing a lot of repeated work as you convert to Text, add more stuff to the builder, convert to Text again, etc. It would depend on your usage pattern though.

You might be interested in my splaytree package, which provides splay trees with monoidal annotations, and several different structures build upon them. Other than the splay tree itself, the Set and RangeSet modules have more-or-less complete API's, the Sequence module is mostly a skeleton I used for testing. It's not a "batteries included" solution to what you're looking for (again, @informatikr's answer provides those), but if you want to experiment with monoidal annotations it may be more useful than Data.FingerTree. Be aware that a splay tree can get unbalanced if you traverse all the elements in sequence (or continually snoc onto the end, or similar), but if appends and lookups are interleaved performance can be excellent.




回答2:


In addition to John Lato's answer, I'll add some specific details about the performance of finger trees, since I spent some time looking at that in the past.

The broad summary is:

  • Data.Sequence has great constant factors and asymptotics: it is almost as fast as [] when accessing the front of the list (where both data structures have O(1) asymptotics), and much faster elsewhere in the list (where Data.Sequence's logarithmic asymptotics trounce []'s linear asymptotics).

  • Data.FingerTree has the same asymptotics as Data.Sequence, but is about an order of magnitude slower.

Just like lists, finger trees have high per-element memory overheads, so they should be combined with chunking for better memory and cache use. Indeed, a few packages do this (yi, trifecta, rope). If Data.FingerTree could be brought close to Data.Sequence in performance, I would hope to see a Data.Text.Sequence type, which implemented a finger tree of Data.Text values. Such a type would lose the streaming behaviour of Data.Text.Lazy, but benefit from improved random access and concatenation performance. (Similarly, I would want to see Data.ByteString.Sequence and Data.Vector.Sequence.)

The obstacle to implementing these now is that no efficient and generic implementation of finger trees exists (see below where I discuss this further). To produce efficient implementations of Data.Text.Sequence one would have to completely reimplement finger trees, specialised to Text - just as Data.Text.Lazy completely reimplements lists, specialised to Text. Unfortunately, finger trees are much more complex than lists (especially concatenation!), so this is a considerable amount of work.

So as I see it the answer is:

  • specialised finger trees are great, but a lot of work to implement
  • chunked finger trees (e.g. Data.Text.Sequence) would be great, but at present the poor performance of Data.FingerTree means they are not a viable alternative to chunked lists in the common case
  • builders and chunked lists achieve many of the benefits of chunked finger trees, and so they suffice for the common case
  • in the uncommon case where builders and chunked lists don't suffice, we grit our teeth and put up with the poor constant factors of chunked finger trees (e.g. in yi and trifecta).

Obstacles to an efficient and generic finger tree

Much of the performance gap between Data.Sequence and Data.FingerTree is due to two optimisations in Data.Sequence:

  • The measure type is specialised to Int, so measure manipulations will compile down to efficient integer arithmetic rather

  • The measure type is unpacked into the Deep constructor, which saves pointer dereferences in the inner loops of the tree operations.

It is possible to apply these optimisations in the general case of Data.FingerTree by using data families for generic unpacking and by exploiting GHC's inliner and specialiser - see my fingertree-unboxed package, which brings generic finger tree performance almost up to that of Data.Sequence. Unfortunately, these techniques have some significant problems:

  • data families for generic unpacking is unpleasant for the user, because they have to define lots of instances. There is no clear solution to this problem.

  • finger trees use polymorphic recursion, which GHC's specialiser doesn't handle well (1, 2). This means that, to get sufficient specialisation on the measure type, we need lots of INLINE pragmas, which causes GHC to generate huge amounts of code.

Due to these problems, I never released the package on Hackage.




回答3:


Ignoring your Finger Tree question and only responding to your further explanation: did you look into Data.Text.Lazy.Builder or, specifically for building HTML, blaze-html?

Both allow fast concatenation. For slicing, if that is important for solving your problem, they might not have ideal performance.



来源:https://stackoverflow.com/questions/11151732/why-arent-fingertrees-used-enough-to-have-a-stable-implementation

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!