Integer Square Root Algorithm

sqrt64q, radix-4 digit-recurrence
2026-01-10
Takayuki HOSODA

Overview

This page presents a minimum-error rounded 64-bit integer square-root algorithm and its C implementation, designed for 18?24-bit digital signal processing systems.

With hardware implementation in mind, the algorithm performs a digit-recurrence approximation in 2-bit units (radix-4). After all digits have been generated, the LSB is adjusted so that the final error is confined within ±1/2 LSB.

Since the maximum correctly rounded square root of a 64-bit unsigned integer is 232, the return value requires 33 bits.

Algorithm

This algorithm is a radix-4 digit-recurrence square root defined by
xn+1 = 4xn + k,   k ∈ {0, 1, 2, 3}
and at each step selects the largest k satisfying
(xn+1)2An

Here An denotes the prefix (upper bits) of the radicand used for evaluation at that radix-4 step. In the code, an = a >> (4 * i - 4) corresponds to An.

The square update is given by
(4x + k)2 = 16x2 + 8xk + k2
and the code
(xs << 4) + ((x << 3) + k) * k
implements this exactly. Likewise, the comparison expression
(xs << 4) + ((x * k) << 3) + (k * k)
is algebraically identical. Therefore this loop literally implements the radix-4 square-root definition:
¡ÈAt each step, take two more bits of the radicand and choose the largest digit such that the square remains no greater than the prefix.¡É

LSB Adjustment

The final correction
if (a - x > xs)
    x++;
comes from the rounding analysis. At the end of the digit-recurrence loop we do not have
x2a < (x + 1)2
but instead, due to truncation,
a - x2 ∈ [0, 2x].
Since
(x + 1)2 = x2 + 2x + 1,
rounding to the nearest integer requires
a > x2 + x.
This condition is exactly implemented by
if (a - x > xs)
which corresponds to
floor(sqrt(a) + 0.5)
i.e., correct round-to-nearest, not merely half-LSB truncation.

Error Distribution

For example, a full enumeration over all 38-bit inputs 0 to 238 ¡Ý 1 (238 samples, maximum 274877906943 = 0x3fffffffff) yields the following histogram:

error(-1/4..-1/2 LSB) =  68719476736
error(-1/4..+1/4 LSB) = 137438953472
error(+1/4..+1/2 LSB) =  68719476736

Out of 238 total samples, the rounding-to-nearest error intervals have widths 1/4, 1/2, and 1/4 LSB, respectively, resulting in the expected 1:2:1 distribution.

C Implementation

sqrt64q returns the integer square root of a 64-bit unsigned integer with minimum rounding error.

/* sqrt64q - calculate the square root of a given unsigned integer
 * Rev.1.0 (2026-01-12) (c) 2026 Takayuki HOSODA
 * SPDX-License-Identifier: BSD-3-Clause
 */
#include <inttypes.h>
uint64_t sqrt64(uint64_t a);
uint64_t sqrt64(uint64_t a) {
    uint64_t x       = 0;        // current approximation
    uint64_t xs      = 0;        // square of the current approximation
    uint64_t k          ;        // trial digit in radix-4 (0..3)
    uint64_t an         ;        // A_n: current prefix of the radicand
    for (int i = 2 * sizeof(a); i > 0; i--) {
        if ((an = a >> (4 * i - 4))) {
            k    = (((xs << 4) +   x * 24  + 9) <= an) ? 3 :
                   (((xs << 4) +   x * 16  + 4) <= an) ? 2 :
                   (((xs << 4) +   x *  8  + 1) <= an) ? 1 : 0;
            xs   =   (xs << 4) + ((x << 3) + k) * k;
            x    =   (x  << 2) + k;
        }                        // skip if upper digits are zero
    }
    if (a - x > xs)              // residual = a - x^2 is in [0, 2x]
        x++;                     // round to the nearest integer sqrt
    return x;
}

Note:
The square root of UINT64_MAX (0xffffffffffffffff) is 0x100000000 (= 4294967296), which exceeds UINT32_MAX (0xffffffff); therefore the return type cannot be uint32_t.

Hardware-Oriented Optimization

k Selection Logic

The current implementation
k = (… ? 3 : … ? 2 : … ? 1 : 0);
uses three cascaded comparisons, which map directly to three magnitude comparators in hardware¡½the minimum required for radix-4. The comparison formula is
16x2 + 8xk + k2A,
so This form is highly synthesis-friendly.

Multipliers

The only multiplications required are All of these reduce to shifts, adds, and a tiny 2-bit LUT. No general-purpose multiplier is required, making the design ideal for ASIC and FPGA.

Keeping xs

Keeping xs (the current square) instead of recomputing it is the most important speed optimization in this class of algorithms. Compared with 1-bit radix subtraction methods, radix-4 with stored squares leads to a comparator-centric architecture that supports higher clock rates.

Summary

This algorithm is not a ¡Ènumerical computation¡É but a ¡Ècomparison problem.¡É The inequality
16x2 + 8xk + k2A
is evaluated only three times per step, which translates to very low CPU cost.

Newton–Raphson method requires FPUs and multipliers/dividers, whereas this approach uses only integer ALUs and comparators.

In hardware, each step requires only: And it always finishes in 32 steps for 64-bit inputs.
This is a textbook CORDIC / digit-recurrence structure with high speed, low power, small area, and deterministic behavior.

REFERENCES

SEE ALSO


www.finetune.co.jp [Mail] © 2000 Takayuki HOSODA.