optimizing byte-pair encoding

后端未结

关注

 9  1143

广开言路 2020-12-30 10:45

Noticing that byte-pair encoding (BPE) is sorely lacking from the large text compression benchmark, I very quickly made a trivial literal implementation of

9条回答

南笙 (楼主)

2020-12-30 11:23
I've done work with optimizing a LZF compression implementation, and some of the same principles I used to improve performance are usable here.

To speed up performance on byte-pair encoding:
1. Limit the block size to less than 65kB (probably 8-16 kB will be optimal). This guarantees not all bytes will be used, and allows you to hold intermediate processing info in RAM.
2. Use a hashtable or simple lookup table by short integer (more RAM, but faster) to hold counts for a byte pairs. There are 65,656 2-byte pairs, and BlockSize instances possible (max blocksize 64k). This gives you a table of 128k possible outputs.
3. Allocate and reuse data structures capable of holding a full compression block, replacement table, byte-pair counts, and output bytes in memory. This sounds wasteful of RAM, but when you consider that your block size is small, it's worth it. Your data should be able to sit entirely in CPU L2 or (worst case) L3 cache. This gives a BIG speed boost.
4. Do one fast pass over the data to collect counts, THEN worry about creating your replacement table.
5. Pack bytes into integers or short ints whenever possible (applicable mostly to C/C++). A single entry in the counting table can be represented by an integer (16-bit count, plus byte pair).
0 讨论(0)

查看其它9个回答
发布评论:

提交评论
- 加载中...