optimizing byte-pair encoding

后端 未结 9 1102
广开言路
广开言路 2020-12-30 10:45

Noticing that byte-pair encoding (BPE) is sorely lacking from the large text compression benchmark, I very quickly made a trivial literal implementation of

9条回答
  •  离开以前
    2020-12-30 11:45

    You can also optimize the dictionary so that:

    AA1BB2CC3DD4EE5FF6GG7HH8 is a sequential run of 8 token.

    Rewrite that as:

    AA1<255>BBCCDDEEFFGGHH<255> where the <255> tells the program that each of the following byte pairs (up to the next <255>) are sequential and incremented by one. Works great for text files and any where there are at least 4 sequential tokens.

    save 175 bytes on recent test.

提交回复
热议问题