optimizing byte-pair encoding

后端未结

关注

 9  1102

广开言路 2020-12-30 10:45

Noticing that byte-pair encoding (BPE) is sorely lacking from the large text compression benchmark, I very quickly made a trivial literal implementation of

9条回答

离开以前 (楼主)

2020-12-30 11:45

You can also optimize the dictionary so that:

AA1BB2CC3DD4EE5FF6GG7HH8 is a sequential run of 8 token.

Rewrite that as:

AA1<255>BBCCDDEEFFGGHH<255> where the <255> tells the program that each of the following byte pairs (up to the next <255>) are sequential and incremented by one. Works great for text files and any where there are at least 4 sequential tokens.

save 175 bytes on recent test.

0 讨论(0)

查看其它9个回答
发布评论:

提交评论
- 加载中...