precision | 易学教程

Decimal places in SQL

阅读更多关于 Decimal places in SQL

问题 I am calculating percentages. One example is coming down to 38589/38400 So the percentage is 100*(38589/38400) which equals something like 100.4921875, but the result shows up as 100. How can I get it to be displayed with x number of decimals? Similarly, will the same work if i'd like 2 to be displayed as 2.000000? Thanks! 回答1: You can cast it to a specific data type, which preserves the data type as well as rounding to a certain precision select cast(100*(38589/38400) as decimal(10,4)) FYI

Why does AVX512-IFMA support only 52-bit ints?

阅读更多关于 Why does AVX512-IFMA support only 52-bit ints?

问题 From the value we can infer that it uses the same components as double-precision floating-point hardware. But double has 53 bits of mantissa, so why is AVX512-IFMA limited to 52 bits? 回答1: IEEE-754 double precision actually only has 52 explicitly stored bits, the 53rd bit (the most significant bit) is an implicit 1. 来源： https://stackoverflow.com/questions/28862012/why-does-avx512-ifma-support-only-52-bit-ints

output to stream float numbers with precision

阅读更多关于 output to stream float numbers with precision

问题 I have a problem with float numbers precision: int main(void) { double b = 106.829599; float a = b; std::cerr << std::setprecision(6) << "a = " << a << "; b = " << b << std::endl; std::cerr << std::setprecision(7) << "a = " << a << "; b = " << b << std::endl; } result is: a = 106.83; b = 106.83 a = 106.8296; b = 106.8296 So, my question is why numbers in first line are so short (I was expecting to see 106.829) gcc 4.1.2, also I made a test at LWS 回答1: Actually, 106.829599 rounded to 6 digits

tf.round() to a specified precision

阅读更多关于 tf.round() to a specified precision

问题 tf.round(x) rounds the values of x to integer values. Is there any way to round to, say, 3 decimal places instead? 回答1: You can do it easily like that, if you don't risk reaching too high numbers: def my_tf_round(x, decimals = 0): multiplier = tf.constant(10**decimals, dtype=x.dtype) return tf.round(x * multiplier) / multiplier Mention: The value of x * multiplier should not exceed 2^32. So using the above method, should not rounds too high numbers. 来源： https://stackoverflow.com/questions

Better approximation of e with Java

阅读更多关于 Better approximation of e with Java

问题 I would like to approximate the value of e to any desired precision. What is the best way to do this? The most I've been able to get is e = 2.7182818284590455. Any examples on a modification of the following code would be appreciated. public static long fact(int x){ long prod = 1; for(int i = 1; i <= x; i++) prod = prod * i; return prod; }//fact public static void main(String[] args) { double e = 1; for(int i = 1; i < 50; i++) e = e + 1/(double)(fact(i)); System.out.print("e = " + e); }//main

Can float be round tripped via double without losing precision?

阅读更多关于 Can float be round tripped via double without losing precision?

问题 If I have a C# float , can I convert it to double without losing any precision? If that double were converted back to float , would it have exactly the same value? 回答1: Yes. IEEE754 floating point (which is what C# must use) guarantees this: Converting a float to a double preserves exactly the same value Converting that double back to a float recovers exactly that original float . The set of double s is a superset of float s. Note that this also applies to NaN , +Infinity , and -Infinity .

Typecasting std::complex<double> to __complex128

阅读更多关于 Typecasting std::complex to __complex128

问题 I'm trying to use the quadmath library in GCC. I have a complex double value I'd like to typecast into the corresponding quad precision complex number, __complex128 . The following is a minimal (non)-working example: #include <quadmath.h> #include <complex> #include <stdio.h> using namespace std::complex_literals; int main(){ std::complex<double> x = 1 + 2i; std::printf("x = %5.5g + %5.5g\n", x.real(), x.imag()); __complex128 y = 2+2i; y = x; return 0; } When I try compiling this code with g+

Compare a 32 bit float and a 32 bit integer without casting to double, when either value could be too large to fit the other type exactly

阅读更多关于 Compare a 32 bit float and a 32 bit integer without casting to double, when either value could be too large to fit the other type exactly

问题 I have a 32 bit floating point f number (known to be positive) that I need to convert to 32 bit unsigned integer. It's magnitude might be too large to fit. Furthermore, there is downstream computation that requires some headroom. I can compute the maximum acceptable value m as a 32 bit integer. How do I efficiently determine in C++11 on a constrained 32 bit machine (ARM M4F) if f <= m mathematically. Note that the types of the two values don't match. The following three approaches each have

How to use “%f” to populate a double value into a string with the right precision

阅读更多关于 How to use “%f” to populate a double value into a string with the right precision

问题 I am trying to populate a string with a double value using a sprintf like this: sprintf(S, "%f", val); But the precision is being cut off to six decimal places. I need about 10 decimal places for the precision. How can that be achieved? 回答1: %[width].[precision] Width should include the decimal point. %8.2 means 8 characters wide; 5 digits before the point and 2 after. One character is reserved for the point. 5 + 1 + 2 = 8 回答2: What you want is a modifier: sprintf(S, "%.10f", val); man

Interpreting a 32bit unsigned long as Single Precision IEEE-754 Float in C

阅读更多关于 Interpreting a 32bit unsigned long as Single Precision IEEE-754 Float in C

问题 I am using the XC32 compiler from Microchip, which is based on the standard C compiler. I am reading a 32bit value from a device on a RS485 network and storing this in a unsigned long that I have typedef'ed as DWORD. i.e. typedef DWORD unsigned long; As it stands, when I typecast this value to a float, the value I get is basically the floating point version of it's integer representation and not the proper IEEE-754 interpreted float. i.e. DWORD dword_value = readValueOnRS485(); float temp =