I want to design a synthesizable module in Verilog which will take only one cycle in calculating square root of given input of 32 bit.
[Edit1] repaired code
Recently found the results where off even if tests determine all was OK so I dig deeper and found out that I had a silly bug in my equation and due to name conflicts with my pgm environment the tests got false positives so I overlooked it before. Now it work in all cases as it should.
The best thing I can think of (except approximation or large LUT) is binary search without multiplication, here C++ code:
//---------------------------------------------------------------------------
WORD u32_sqrt(DWORD xx) // 16 T
{
DWORD x,m,a0,a1,i;
const DWORD lut[16]=
{
// m*m
0x40000000,
0x10000000,
0x04000000,
0x01000000,
0x00400000,
0x00100000,
0x00040000,
0x00010000,
0x00004000,
0x00001000,
0x00000400,
0x00000100,
0x00000040,
0x00000010,
0x00000004,
0x00000001,
};
for (x=0,a0=0,m=0x8000,i=0;m;m>>=1,i++)
{
a1=a0+lut[i]+(x<<(16-i));
if (a1<=xx) { a0=a1; x|=m; }
}
return x;
}
//---------------------------------------------------------------------------
Standard binary search sqrt(xx)
is setting bits of x
from MSB to LSB so that result of x*x <= xx
. Luckily we can avoid the multiplication by simply rewrite the thing as incrementing multiplicant... in each iteration the older x*x
result can be used like this:
x1 = x0+m
x1*x1 = (x0+m)*(x0+m) = (x0*x0) + (2*m*x0) + (m*m)
Where x0
is value of x
from last iteration and x1
is actual value. The m
is weight of actual processed bit. The (2*m)
and (m*m)
are constant and can be used as LUT and bit-shift so no need to multiply. Only addition is needed. Sadly the iteration is bound to sequential computation forbid paralelisation so the result is 16T
at best.
In the code a0
represents last x*x
and a1
represents actual iterated x*x
As you can see the sqrt
is done in 16 x (BitShiftLeft,BitShiftRight,OR,Plus,Compare)
where the bit shift and LUT can be hardwired.
If you got super fast gates for this in comparison to the rest you can multiply the input clock by 16
and use that as internal timing for SQRT module. Something similar to the old days when there was MC clock as Division of source CPU clock in old Intel CPU/MCUs ... This way you can get 1T
timing (or multiple of it depends on the multiplication ratio).
I got the code here it is
module sqrt(
input[31:0]a,
output[15:0]out
);
reg [31:0]temp;
reg[14:0]x;
always@(a)
begin
if(a<257)x=4;
if(a>256 && a<65537)x=80;
if(a>65536 && a<16777217)x=1000;
if(a>16777216 && a<=4294967295)x=20000;
temp=(x+(a/x))/2;
temp=(temp+(a/temp))/2;
temp=(temp+(a/temp))/2;
temp=(temp+(a/temp))/2;
temp=(temp+(a/temp))/2;
temp=(temp+(a/temp))/2;
temp=(temp+(a/temp))/2;
end
assign out=temp;
endmodule
The usual means of doing this in hardware is using a CORDIC. A general implementation allows the calculation of a variety of transcendental functions (cos/sin/tan) and... square roots depending on how you initialize and operate the CORDIC.
It's an iterative algorithm so to do it in a single cycle you'd unroll the loop into as many iterations as you require for your desired precision and chain the instances together.
Specifically if you operated the CORDIC in vectoring mode, initialize it with [x, 0] and rotate to 45 degrees the [x', y'] final output will be a multiplicative constant away. i.e. sqrt(x) = x' * sqrt(2) * K
There is conversion to a logarithm, halving, and converting back.
For an idea how to implement "combinatorial" log and antilog, see Michael Dunn's EDN article showing priority encoder, barrel shifter & lookup table, with three log variants in System Verilog for down-load.
(Priority encoder, barrel shifter & lookup table look promising for "one-step-Babylonian/Heron/Newton/-Raphson. But that would probably still need a 128K by 9 bits lookup table.)
While not featuring "verilog",
Tole Sutikno: "An Optimized Square Root Algorithm for Implementation in FPGA Hardware" shows a combinatorial implementation of a modified (binary) digit-by-digit algorithm.
来源:https://stackoverflow.com/questions/34652001/how-to-get-a-square-root-for-32-bit-input-in-one-clock-cycle-only