Integer Square Root Algorithm

sqrt64q, radix-4 digit-recurrence

2026-01-10

Takayuki HOSODA

Overview

This page presents a minimum-error rounded 64-bit integer square-root algorithm and its C implementation, designed for 18?24-bit digital signal processing systems.

With hardware implementation in mind, the algorithm performs a digit-recurrence approximation in 2-bit units (radix-4). After all digits have been generated, the LSB is adjusted so that the final error is confined within ±1/2 LSB.

Since the maximum correctly rounded square root of a 64-bit unsigned integer is 2³², the return value requires 33 bits.

Algorithm

This algorithm is a radix-4 digit-recurrence square root defined by

x_n+1 = 4x_n + k, k ∈ {0, 1, 2, 3}

and at each step selects the largest k satisfying

(x_n+1)² ≤ A_n

Here A_n denotes the prefix (upper bits) of the radicand used for evaluation at that radix-4 step. In the code, an = a >> (4 * i - 4) corresponds to A_n.

The square update is given by

(4x + k)² = 16x² + 8xk + k²

and the code

(xs << 4) + ((x << 3) + k) * k

implements this exactly. Likewise, the comparison expression

(xs << 4) + ((x * k) << 3) + (k * k)

is algebraically identical. Therefore this loop literally implements the radix-4 square-root definition:

¡ÈAt each step, take two more bits of the radicand and choose the largest digit such that the square remains no greater than the prefix.¡É

LSB Adjustment

The final correction

if (a - x > xs)
    x++;

comes from the rounding analysis. At the end of the digit-recurrence loop we do not have

x² ≤ a < (x + 1)²

but instead, due to truncation,

a - x² ∈ [0, 2x].

Since

(x + 1)² = x² + 2x + 1,

rounding to the nearest integer requires

a > x² + x.

This condition is exactly implemented by

if (a - x > xs)

which corresponds to

floor(sqrt(a) + 0.5)

i.e., correct round-to-nearest, not merely half-LSB truncation.

Error Distribution

For example, a full enumeration over all 38-bit inputs 0 to 2³⁸ ¡Ý 1 (2³⁸ samples, maximum 274877906943 = 0x3fffffffff) yields the following histogram:

error(-1/4..-1/2 LSB) =  68719476736
error(-1/4..+1/4 LSB) = 137438953472
error(+1/4..+1/2 LSB) =  68719476736

Out of 2³⁸ total samples, the rounding-to-nearest error intervals have widths 1/4, 1/2, and 1/4 LSB, respectively, resulting in the expected 1:2:1 distribution.

C Implementation

sqrt64q returns the integer square root of a 64-bit unsigned integer with minimum rounding error.

/* sqrt64q - calculate the square root of a given unsigned integer
 * Rev.1.0 (2026-01-12) (c) 2026 Takayuki HOSODA
 * SPDX-License-Identifier: BSD-3-Clause
 */
#include <inttypes.h>
uint64_t sqrt64(uint64_t a);
uint64_t sqrt64(uint64_t a) {
    uint64_t x       = 0;        // current approximation
    uint64_t xs      = 0;        // square of the current approximation
    uint64_t k          ;        // trial digit in radix-4 (0..3)
    uint64_t an         ;        // A_n: current prefix of the radicand
    for (int i = 2 * sizeof(a); i > 0; i--) {
        if ((an = a >> (4 * i - 4))) {
            k    = (((xs << 4) +   x * 24  + 9) <= an) ? 3 :
                   (((xs << 4) +   x * 16  + 4) <= an) ? 2 :
                   (((xs << 4) +   x *  8  + 1) <= an) ? 1 : 0;
            xs   =   (xs << 4) + ((x << 3) + k) * k;
            x    =   (x  << 2) + k;
        }                        // skip if upper digits are zero
    }
    if (a - x > xs)              // residual = a - x^2 is in [0, 2x]
        x++;                     // round to the nearest integer sqrt
    return x;
}

Note:
The square root of UINT64_MAX (0xffffffffffffffff) is 0x100000000 (= 4294967296), which exceeds UINT32_MAX (0xffffffff); therefore the return type cannot be uint32_t.

Hardware-Oriented Optimization

`k` Selection Logic

The current implementation

k = (… ? 3 : … ? 2 : … ? 1 : 0);

uses three cascaded comparisons, which map directly to three magnitude comparators in hardware¡½the minimum required for radix-4. The comparison formula is

16x² + 8xk + k² ≤ A,

16x² is common,
8x is common,
k² is a small LUT.

This form is highly synthesis-friendly.

Multipliers

The only multiplications required are

x * 3, x * 2, x * 1
k * k (k is 2-bit)

All of these reduce to shifts, adds, and a tiny 2-bit LUT. No general-purpose multiplier is required, making the design ideal for ASIC and FPGA.

Keeping `xs`

Keeping xs (the current square) instead of recomputing it is the most important speed optimization in this class of algorithms. Compared with 1-bit radix subtraction methods, radix-4 with stored squares leads to a comparator-centric architecture that supports higher clock rates.

Summary

This algorithm is not a ¡Ènumerical computation¡É but a ¡Ècomparison problem.¡É The inequality
16x² + 8xk + k² ≤ A
is evaluated only three times per step, which translates to very low CPU cost.
Newton–Raphson method requires FPUs and multipliers/dividers, whereas this approach uses only integer ALUs and comparators.
In hardware, each step requires only:

one 64-bit adder,

three 64-bit comparators,

a shifter,

a small 2-bit LUT for k.

And it always finishes in 32 steps for 64-bit inputs.
This is a textbook CORDIC / digit-recurrence structure with high speed, low power, small area, and deterministic behavior.

REFERENCES

M. D. Ercegovac, T. Lang, Digital Arithmetic, Morgan Kaufmann, 2004. (Chapter 10: Digit-Recurrence Square Root)
P. Markstein, IA-64 and Elementary Functions, Prentice Hall, 2000. (Square Root and Rounding Analysis)