Converting float to UInt32 - which expression is more precise

问题

I have a number float x which should be in <0,1> range but it undergo several numerical operations - the result may be slightly outside <0,1>.

I need to convert this result to uint y using entire range of UInt32. Of course, I need to clamp x in the <0,1> range and scale it.

But which order of operations is better?

y = (uint)round(min(max(x, 0.0F), 1.0F) * UInt32.MaxValue)

y = (uint)round(min(max(x * UInt32.MaxValue, 0.0F), UInt32.MaxValue)

In another words, it is better to scale first, then clamp OR clamp and then scale? I am not very profound in the IEEE floating point representation, but I believe there is a difference in the order of computation of the above expressions.

回答1:

Because the multiplication to get from [0.0f .. 1.0f] to [0 .. UInt32.MaxValue] can itself be approximative, the order of operations that most obviously has the property you desire is multiply, then clamp, then round.

The maximum value to clamp to is the float immediately below 2³², that is, 4294967040.0f. Although this number is several units below UInt32.MaxValue, allowing any larger value would mean overflowing the conversion to UInt32.

Either of the lines below should work:

y = (uint)round(min(max(x * 4294967040.0F, 0.0F), 4294967040.0F))

In this first version, you have the option to multiply by UInt32.MaxValue instead. The choice is between having very slightly larger results overall (and thus rounding to 4294967040 a few more values that were close to 1.0f but below it), or only sending to 4294967040 the values 1.0f and above.

You can also clamp to [0.0f .. 1.0f] if you do not multiply by too large a number afterwards, so that there is no risk of making the value larger than the largest float that can be converted:

y = (uint)round(min(max(x, 0.0F), 1.0F) * 4294967040.0F)

Suggestion for your comment below, about crafting a conversion that goes up to UInt32.MaxValue:

if (x <= 0.0f) y = 0
else if (x < 0.5f) y = (uint) round (x * 4294967296.0F)
else if (x >= 1.0f) y = UInt32.MaxValue
else y = UInt32.MaxValue - (uint) round ((1.0f - x) * 4294967296.0F)

This computation considered as a function from x to y is increasing (including around 0.5f) and it goes up to UInt32.MaxValue. You can re-order the tests according to what you think will be the most likely distribution of values. In particular, assuming that few values are actually below 0.0f or above 1.0f, you can compare to 0.5f first, and then only compare to the bound that is relevant:

if (x < 0.5f)
{
  if (x <= 0.0f) y = ...
  else y = ...
}
else
{
  if (x >= 1.0f) y = ...
  else y = ...
}

回答2:

The three essential attributes of correct color format conversion are:

black must map to black and white must map to white (meaning 0.0 —> 0 and 1.0 —> 2^32-1 in this case)
the intervals in the source format which map to each value in the destination format must have widths that are as equal as possible.
evenly spaced inputs should map to outputs that are as evenly spaced as possible in the destination format.

A corollary of the second point is that color format conversions that use round are almost always incorrect, because the bins that map to the minimum and maximum results are usually too small by half. This isn’t as critical with high precision formats like uint32, but it’s still good to get right.

You mentioned in a comment that your C# code is being translated to OpenCL. OpenCL has by far the nicest set of conversions of any language I’ve encountered (seriously, if you’re designing a compute-oriented language and you don’t copy what OpenCL did here, you’re doing it wrong), which makes this pretty easy:

convert_uint_sat(x * 0x1.0p32f)

However, your question is actually about C#; I’m not a C# programmer, but the approach there should look something like this:

if (x <= 0.0F) y = UInt32.MinValue;
else if (x >= 1.0F) y = UInt32.MaxValue;
else y = (uint)Math.Truncate(x * 4294967296.0F);

回答3:

Given that x might be slightly outside [0,1] the second approach is not as easy as the first one due to clamping issues in UInt32-valuespace, ie every number in UInt32 is valid. The first one is also easier to understand, ie get a number in an interval and scale.

Ie:

var y = (UInt32) (Math.Min(Math.Max(x, 0f), 1f) * UInt32.MaxValue);

Also, I tested it with a couple of millions of values, they give the same result. It doesn't matter which one you use.

回答4:

Single can't support enough accuracy to maintain the interim result, so you'll need to scale then clamp, but you can't clamp to UInt32.MaxValue because it can't be represented by single. The maximum UInt32 you can safely clamp to is 4294967167

from this code here

        Single maxUInt32 = (Single)UInt32.MaxValue;
        Double accurateValue = maxUInt32;
        while (true)
        {
            accurateValue -= 1;
            Single temp = (Single)accurateValue;
            Double temp2 = (Double)temp;
            if (temp2 < (Double)UInt32.MaxValue)
                break;
        }

See this test...

        Double val1 = UInt32.MaxValue;
        Double val2 = val1 + 1;

        Double valR = val2 / val1;

        Single sValR = (Single)valR;

        //Straight Scale and Cast
        UInt32 NewValue = (UInt32)(sValR * UInt32.MaxValue);
        //Result = 0;

        //Clamp Then Scale Then Cast
        UInt32 NewValue2 = (UInt32)(Math.Min(sValR, 1.0f) * UInt32.MaxValue);
        //Result = 0;

        //Scale Then Clamp Then Cast
        UInt32 NewValue3 = (UInt32)(Math.Min(sValR * UInt32.MaxValue, UInt32.MaxValue));
        //Result = 0;

        //Using Doubles
        //Straight Scale and Cast
        UInt32 NewValue4 = (UInt32)(valR * UInt32.MaxValue);
        //Result = 0;

        //Clamp Then Scale Then Cast
        UInt32 NewValue5 = (UInt32)(Math.Min(valR, 1.0f) * UInt32.MaxValue);
        //Result = 4294967295;

        //Scale Then Clamp Then Cast
        UInt32 NewValue6 = (UInt32)(Math.Min(valR * UInt32.MaxValue, UInt32.MaxValue));
        //Result = 4294967295;

        //Comparing to 4294967167
        //Straight Scale and Cast
        UInt32 NewValue7 = (UInt32)(sValR * UInt32.MaxValue);
        //Result = 0;

        //Clamp Then Scale Then Cast
        UInt32 NewValue8 = (UInt32)(Math.Min(sValR, 1.0f) * UInt32.MaxValue);
        //Result = 0;

        //Scale Then Clamp Then Cast
        UInt32 NewValue9 = (UInt32)(Math.Min(sValR * UInt32.MaxValue, 4294967167));
        //Result = 4294967040;

来源：https://stackoverflow.com/questions/24360347/converting-float-to-uint32-which-expression-is-more-precise

标签

floating-point

floating-point-precision

numerical-stability