Optimize an array of tribools for space

你说的曾经没有我的故事 提交于 2019-12-10 03:07:06

问题


Let me start with some background:

By "tribool" I understand a variable which can hold one of the following values: true, false or null.

In question Copying array of ints vs pointers to bools , the OP wanted to have an array of tribools (more or less) which would be as small as possible.

With "a bit of" most basic bit-fu I came up a solution which used 2 bits per tribool and allowed to store the OP's array of 64 tribools in 16 bytes, which is OK.

The tribool mechanics I used were simple, like:

  • boolean A means "null or not null",
  • boolean B means "true or false if not null".

But then I thought... An algorithmical definition of a "bit" is:

A bit is the amount of information which specifies which of two equally probable events shall occur.

Clearly a true/false value is 1 bit big. Two true-false values as a whole are 2 bit big.

And what about our conceptual tribool?

My point is: In terms of the size of contained information, a tribool is bigger than 1 bit but smaller than 2 bits.

  • Justification 1: Assume we implement our if boolean as described above. If boolean A is "null", the value of boolean B is redundant and doesn't carry any relevant information.
  • Justification 2: It's impossible to store information from 2 independent boolean values in one tribool, so it has

(None of the above is a formal proof, but I believe that we can agree that about the "size" of the tribool being strictly bigger than 1 bit and strictly smaller than 2.)


My question is:

How to programatically take advantage of the fact that a tribool has less information than 2 bits, and implement in software (c, c++?) an array of N tribools which would have the memory footprint smaller than N/4 bytes for some N?

Yes, I do understand that such an implementation isn't really hardware-friendly and would perform slower than any common solution with redundance (as those presented in the OP's question). Let's just optimize for space, not for efficiency.

Clearly this implementation needs a different representation of a tribool than a pair of bools (which is by itself redundant, as described before). The theory says it's possible to achieve that goal and I like to see an actual implementation. Any ideas?


回答1:


Your intuition is correct, this is certainly possible. This is basically a form of arithmetic coding, or at least a simple instance of it.

The easiest way to think of it is to imagine encoding your array of "tribools" as a number in base 3 - e.g. 0=FALSE, 1=TRUE, 2=NULL. Then the following array:

{TRUE, FALSE, NULL, NULL, FALSE, FALSE, TRUE}

encodes to the number

1022001

which you can then convert to decimal in the normal way:

(1*3^0)+(0*3^1)+(0*3^2)+(2*3^3)+(2*3^4)+(0*3^5)+(1*3^6) = 946

Each tribool takes up ln(3)/ln(2) bits (about 1.58), so using this method you can store 20 tribools in 32 bits - so you can store an N=20 array in 4 bytes (where N/4 is 5).




回答2:


You can theoretically pack X N-state variables in

ln(N^X) / ln M

M-state (or log_M (N^X) in LaTeX-like notation) variables. For storing tri-state variables in binary digits the formula above becomes:

ln(3^N) / ln 2

In an 8-bit byte, for example you could fit 5 tri-state variables.

Unpacking/Modifying those values would be a lot harder and slower as you pack variables more densely. In the example above you would have to recalculate the whole byte in order to change a single tri-state variable.

It should be noted that a byte for 5 tri-state variables is quite space-efficient. The density remains the same per-byte, until you have a pack of 22 bytes, which can fit 111 tri-state values, instead of 110. Handling that kind of packing would a mess, though.

Is any of this worth the extra work in comparison to directly storing 4 tri-state values in a byte?




回答3:


This solution requires you to know up front how many "non-null" values you're going to have (i.e. during compile time, or if you could start counting how many non-nulls there are before making the space available).

You could then encode it the following way:

0 for null 1 for non-null, followed by 1 or 0 for true or false.

This would result in a max of 2 bits per tribool, and just 1 bit if they're all null.




回答4:


@psmears is right, for the case where all 3 values are equally likely. However, if they were not equally likely, or were not independent, if you had a long enough string of them, you could just use your 2-bit or any other coding and run gzip on it. That should compress it down to about the theoretical limit. Like in the limit where all the values were 0, it should come out being not much more than the log of the length of the string.

BTW: We're talking about entropy here. A simple definition in this case is -P(0)logP(0) - P(1)logP(1) - P(null)logP(null). So, for example, if P(0) = P(1) = 1/2, and P(null) = 0, then the entropy is 1 bit. If P(0) = 1/2, P(1) = 1/4, P(null) = 1/4, then the entropy is 1/2 * 1 + 1/4 * 2 + 1/4 * 2 also = 1 bit. If the probabilities are 1022/1024, 1/1024, 1/1024, then the entropy is (almost 1)*(almost 0) + 10/1024 + 10/1024 which is about equal to 20/1024 or about 2 hundredths of a bit! The more certain something is, the less it tells you when it occurs, so the less storage it needs.




回答5:


I like the solution proposed by @psmears, but its drawback is that it's slower that the direct approach. You can use a slightly modified version, that should also be fast:

3**5 == 243, that is almost 256. This means that you can easily squeeze 5 tribool values in a byte. It has the same compression ratio, but because each byte is independent, it can be implemented using LUTs:

unsigned char get_packed_tribool(unsigned char pk, int num)
{ // num = (0..4), pk = (0..242)
    return LUT[num][pk];    // 5*243 bytes of LUTs
};

unsigned char update_packed_tribool(unsigned char old_pk, int num, int new_val)
{ // new_val = 0..2
    return old_pk + (new_val - LUT[num][old_pk])*POW3_LUT[num];
};


来源:https://stackoverflow.com/questions/4416744/optimize-an-array-of-tribools-for-space

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!