This page presents a minimum-error rounded 64-bit integer square-root algorithm and its C implementation, designed for 18?24-bit digital signal processing systems.
With hardware implementation in mind, the algorithm performs a digit-recurrence approximation in 2-bit units (radix-4). After all digits have been generated, the LSB is adjusted so that the final error is confined within ±1/2 LSB.
Since the maximum correctly rounded square root of a 64-bit unsigned integer is 232, the return value requires 33 bits.
Here An denotes the prefix (upper bits) of the radicand used for evaluation at that radix-4 step.
In the code,
an = a >> (4 * i - 4)
corresponds to An.
(xs << 4) + ((x << 3) + k) * k
implements this exactly.
Likewise, the comparison expression
(xs << 4) + ((x * k) << 3) + (k * k)
is algebraically identical.
Therefore this loop literally implements the radix-4 square-root definition:
¡ÈAt each step, take two more bits of the radicand and choose the largest digit such that the square remains no greater than the prefix.¡É
if (a - x > xs)
x++;
comes from the rounding analysis.
At the end of the digit-recurrence loop we do not have
if (a - x > xs)
which corresponds to
floor(sqrt(a) + 0.5)
i.e., correct round-to-nearest, not merely half-LSB truncation.
For example, a full enumeration over all 38-bit inputs 0 to 238 ¡Ý 1 (238 samples, maximum 274877906943 = 0x3fffffffff) yields the following histogram:
error(-1/4..-1/2 LSB) = 68719476736 error(-1/4..+1/4 LSB) = 137438953472 error(+1/4..+1/2 LSB) = 68719476736
Out of 238 total samples, the rounding-to-nearest error intervals have widths 1/4, 1/2, and 1/4 LSB, respectively, resulting in the expected 1:2:1 distribution.
sqrt64q returns the integer square root of a 64-bit unsigned integer with minimum rounding error.
/* sqrt64q - calculate the square root of a given unsigned integer * Rev.1.0 (2026-01-12) (c) 2026 Takayuki HOSODA * SPDX-License-Identifier: BSD-3-Clause */ #include <inttypes.h> uint64_t sqrt64(uint64_t a); uint64_t sqrt64(uint64_t a) { uint64_t x = 0; // current approximation uint64_t xs = 0; // square of the current approximation uint64_t k ; // trial digit in radix-4 (0..3) uint64_t an ; // A_n: current prefix of the radicand for (int i = 2 * sizeof(a); i > 0; i--) { if ((an = a >> (4 * i - 4))) { k = (((xs << 4) + x * 24 + 9) <= an) ? 3 : (((xs << 4) + x * 16 + 4) <= an) ? 2 : (((xs << 4) + x * 8 + 1) <= an) ? 1 : 0; xs = (xs << 4) + ((x << 3) + k) * k; x = (x << 2) + k; } // skip if upper digits are zero } if (a - x > xs) // residual = a - x^2 is in [0, 2x] x++; // round to the nearest integer sqrt return x; }
Note:
The square root of UINT64_MAX (0xffffffffffffffff) is
0x100000000 (= 4294967296),
which exceeds UINT32_MAX (0xffffffff);
therefore the return type cannot be uint32_t.
k = (… ? 3 : … ? 2 : … ? 1 : 0);
uses three cascaded comparisons, which map directly to three magnitude comparators in hardware¡½the minimum required for radix-4.
The comparison formula is
16x2 is common,8x is common,k2 is a small LUT.x * 3, x * 2, x * 1k * k (k is 2-bit)xsxs (the current square) instead of recomputing it is the most important speed optimization in this class of algorithms.
Compared with 1-bit radix subtraction methods, radix-4 with stored squares leads to a comparator-centric architecture that supports higher clock rates.
This algorithm is not a ¡Ènumerical computation¡É but a ¡Ècomparison problem.¡É The inequality16x2 + 8xk + k2 ≤ Ais evaluated only three times per step, which translates to very low CPU cost.Newton–Raphson method requires FPUs and multipliers/dividers, whereas this approach uses only integer ALUs and comparators.
In hardware, each step requires only:And it always finishes in 32 steps for 64-bit inputs.
- one 64-bit adder,
- three 64-bit comparators,
- a shifter,
- a small 2-bit LUT for k.
This is a textbook CORDIC / digit-recurrence structure with high speed, low power, small area, and deterministic behavior.



© 2000 Takayuki HOSODA.