optimizing byte-pair encoding

后端 未结 9 1144
广开言路
广开言路 2020-12-30 10:45

Noticing that byte-pair encoding (BPE) is sorely lacking from the large text compression benchmark, I very quickly made a trivial literal implementation of

9条回答
  •  一个人的身影
    2020-12-30 11:34

    I don't believe this can be done in a single pass unless you find a way to predict, given a byte-pair replacement, if the new byte-pair (after-replacement) will be good for replacement too or not.

    Here are my thoughts at first sight. Maybe you already do or have already thought all this.

    I would try the following.

    Two adjustable parameters:

    1. Number of byte-pair occurrences in chunk of data before to consider replacing it. (So that the dictionary doesn't grow faster than the chunk shrinks.)
    2. Number of replacements by pass before it's probably not worth replacing anymore. (So that the algorithm stops wasting time when there's maybe only 1 or 2 % left to gain.)

    I would do passes, as long as it is still worth compressing one more level (according to parameter 2). During each pass, I would keep a count of byte-pairs as I go.

    I would play with the two parameters a little and see how it influences compression ratio and speed. Probably that they should change dynamically, according to the length of the chunk to compress (and maybe one or two other things).

    Another thing to consider is the data structure used to store the count of each byte-pair during the pass. There very likely is a way to write a custom one which would be faster than generic data structures.

    Keep us posted if you try something and get interesting results!

提交回复
热议问题