Is there an efficient algorithm to enumerate the factors of a number n, in ascending order, without sorting? By “efficient” I mean:
The algorithm a
[I'm posting this answer just as a formality for completeness. I've already chosen someone else's answer as the accepted answer.]
Overview of algorithm. In searching for the fastest way to generate an in-memory list of factors (64-bit unsigned values in my case), I settled upon a hybrid algorithm that implements a two-dimensional bucket sort, which takes advantage of the internal knowledge of the sort keys (i.e., they are just integers and can therefore be computed upon). The specific method is something closer to a “ProxMapSort” but with two levels of keys (major, minor) instead of just one. The major key is simply the base-2 logarithm of the value. The minor key is the minimal number of most significant digits of the value needed to produce a reasonable spread at the second layer of buckets. Factors are produced first into a temporary work array of unsorted factors. Next, their distribution is analyzed and an array of bucket indexes is allocated and populated. Finally, the factors are stored directly into place in the final sorted array, using insertion sort. The vast majority of buckets have only 1, 2, or 3 values. Examples are given in the source code, which is attached at the bottom of this answer.
Spatial complexity. Memory utilization is approximately 4x that of a Quicksort-based solution, but this is actually rather insignificant, as the maximum memory ever used in the worst case (for 64-bit input) is 5.5 MB, of which 4.0 MB is held for only a small handful of milliseconds.
Runtime complexity. Performance is far better than a hand-coded Quicksort-based solution: for numbers with a nontrivial number of factors, it is unformly about 2.5x times faster. On my 3.4 GHz. Intel i7, it produces the 184,320 factors of 18,401,055,938,125,660,800 in sorted order in 0.0052 seconds, or about 96 clock cycles per factor, or about 35 million factors per second.
Graphs. Memory and runtime performance were profiled for the 47,616 canonical representatives of the equivalance classes of prime signatures of numbers up to 2⁶⁴–1. These are the so-called “highly factorable numbers” in 64-bit search space.
Total runtime is ~2.5x better than a Quicksort-based solution for nontrivial factor counts, shown below on this log–log plot:

The number of sorted factors produced per second is essentially the inversion of the above. Performance on a per-factor basis declines after the sweet spot of approximately 2000 factors, but not by much. Performance is affected by L1, L2, and L3 cache sizes, as well as the count of unique prime factors of the number being factored, which goes up roughly with the logarithm of the input value.

Peak memory usage is a straight line in this log–log plot, since it is proportional to the base-2 logarithm of the number of factors. Note that peak memory usage is only for a very brief period of time; short-lived work arrays are discarded within milliseconds. After the temporary arrays are discarded, what remains is the final list of factors, which is the same minimal usage as seen in the Quicksort-based solution.

Source Code. Attached below is a proof-of-concept program in the C programming language (specifically, C11). It has been tested on x86-64 with Clang/LLVM, although it should work fine with GCC as well.
/*==============================================================================
DESCRIPTION
This is a small proof-of-concept program to test the idea of "sorting"
factors using a form of bucket sort. The method is essentially a 2D version
of ProxMapSort that has tuned for vast, nonlinear distributions using two
keys (major, minor) rather than one. The major key is simply the floor of
the base-2 logarithm of the value, and the minor key is derived from the most
significant bits of the value.
INPUT
Input is given on the command line, either as a single argument giving the
number to be factored or an even number of arguments giving the 2-tuples that
comprise the prime-power factorization of the desired number. For example,
the number
75600 = 2^4 x 3^3 x 5^2 x 7
can be given by the following list of arguments:
2 4 3 3 5 2 7 1
Note: If a single number is given, it will require factoring to produce its
prime-power factorization. Since this is just a small test program, a very
crude factoring method is used that is extremely fast for small prime factors
but extremely slow for large prime factors. This is actually fine, because
the largest factor lists occur with small prime factors anyway, and it is the
production of large factor lists at which this program aims to be proficient.
It is simply not interesting to be fast at producing the factor list of a
number like 17293823921105882610 = 2 x 3 x 5 x 576460797370196087, because
it has only 32 factors. Numbers with tens or hundreds of thousands of
factors are much more interesting.
OUTPUT
Results are written to standard output. A list of factors in ascending order
is produced, followed by runtime (in microseconds) required to generate the
list (not including time to print it).
STATISTICS
Bucket size statistics for the 47616 canonical representatives of the prime
signature equivalence classes of 64-bit numbers:
==============================================================
Bucket size Total count of factored Total count of
b numbers needing size b buckets of size b
--------------------------------------------------------------
1 47616 (100.0%) 514306458 (76.2%)
2 47427 (99.6%) 142959971 (21.2%)
3 43956 (92.3%) 16679329 (2.5%)
4 27998 (58.8%) 995458 (0.1%)
5 6536 (13.7%) 33427 (<0.1%)
6 400 (0.8%) 729 (<0.1%)
7 12 (<0.1%) 18 (<0.1%)
--------------------------------------------------------------
~ 47616 (100.0%) 674974643 (100.0%)
--------------------------------------------------------------
Thus, no 64-bit number (of the input set) ever requires more than 7 buckets,
and the larger the bucket size the less frequent it is. This is highly
desirable. Note that although most numbers need at least 1 bucket of size 5,
the vast majority of buckets (99.9%) are of size 1, 2, or 3, meaning that
insertions are extremely efficient. Therefore, the use of insertion sort
for the buckets is clearly the right choice and is arguably optimal for
performance.
AUTHOR
Todd Lehman
2015/05/08
*/
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
typedef unsigned int uint;
typedef uint8_t uint8;
typedef uint16_t uint16;
typedef uint32_t uint32;
typedef uint64_t uint64;
#define ARRAY_CAPACITY(x) (sizeof(x) / sizeof((x)[0]))
//-----------------------------------------------------------------------------
// This structure is sufficient to represent the prime-power factorization of
// all 64-bit values. The field names ω and Ω are dervied from the standard
// number theory functions ω(n) and Ω(n), which count the number of unique and
// non-unique prime factors of n, respectively. The field name d is derived
// from the standard number theory function d(n), which counts the number of
// divisors of n, including 1 and n.
//
// The maximum possible value here of ω is 15, which occurs for example at
// n = 7378677391061896920 = 2^3 x 3^2 x 5 x 7 x 11 x 13 x 17 x 19 x 23 x 29
// 31 x 37 x 41 x 43 x 47, which has 15 unique prime factors.
//
// The maximum possible value of Ω here is 63, which occurs for example at
// n = 2^63 and n = 2^62 x 3, both of which have 63 non-unique prime factors.
//
// The maximum possible value of d here is 184320, which occurs at
// n = 18401055938125660800 = 2^7 x 3^4 x 5^2 x 7^2 x 11 x 13 x 17 x 19 x 23 x
// 29 x 31 x 37 x 41.
//
// Maximum possible exponents when exponents are sorted in decreasing order:
//
// Index Maximum Bits Example of n
// ----- ------- ---- --------------------------------------------
// 0 63 6 (2)^63
// 1 24 5 (2*3)^24
// 2 13 4 (2*3*5)^13
// 3 8 4 (2*3*5*7)^8
// 4 5 3 (2*3*5*7*11)^5
// 5 4 3 (2*3*5*7*11*13)^4
// 6 3 2 (2*3*5*7*11*13*17)^3
// 7 2 2 (2*3*5*7*11*13*17*19)^2
// 8 2 2 (2*3*5*7*11*13*17*19*23)^2
// 9 1 1 (2*3*5*7*11*13*17*19*23*29)^1
// 10 1 1 (2*3*5*7*11*13*17*19*23*29*31)^1
// 11 1 1 (2*3*5*7*11*13*17*19*23*29*31*37)^1
// 12 1 1 (2*3*5*7*11*13*17*19*23*29*31*37*41)^1
// 13 1 1 (2*3*5*7*11*13*17*19*23*29*31*37*41*43)^1
// 14 1 1 (2*3*5*7*11*13*17*19*23*29*31*37*41*43*47)^1
// ----- ------- ---- --------------------------------------------
// 15 63 37
//
#pragma pack(push, 8)
typedef struct
{
uint8 e[16]; // Exponents.
uint64 p[16]; // Primes in increasing order.
uint8 ω; // Count of prime factors without multiplicity.
uint8 Ω; // Count of prime factors with multiplicity.
uint32 d; // Count of factors of n, including 1 and n.
uint64 n; // Value of n on which all other fields of this struct depend.
}
PrimePowerFactorization; // 176 bytes with 8-byte packing
#pragma pack(pop)
#define MAX_ω 15
#define MAX_Ω 63
//-----------------------------------------------------------------------------
// Fatal error: print error message and abort.
void fatal_error(const char *format, ...)
{
va_list args;
va_start(args, format);
vfprintf(stderr, format, args);
exit(1);
}
//-----------------------------------------------------------------------------
// Compute 64-bit 2-adic integer inverse.
uint64 uint64_inv(const uint64 x)
{
assert(x != 0);
uint64 y = 1;
for (uint i = 0; i < 6; i++) // 6 = log2(log2(2**64)) = log2(64)
y = y * (2 - (x * y));
return y;
}
//------------------------------------------------------------------------------
// Compute 2 to arbitrary power. This is just a portable and abstract way to
// write a left-shift operation. Note that the use of the UINT64_C macro here
// is actually required, because the result of 1U< 0. Uses fast
// intrinsic function if available; otherwise resorts to hand-rolled method.
static inline
uint uint64_log2(uint64 x)
{
assert(x > 0);
#if defined(UINT64_CLZ)
return 63 - UINT64_CLZ(x);
#else
#define S(k) if ((x >> k) != 0) { y += k; x >>= k; }
uint y = 0; S(32); S(16); S(8); S(4); S(2); S(1); return y;
#undef S
#endif
}
//------------------------------------------------------------------------------
// Compute major key, given a nonzero number. The major key is simply the
// floor of the base-2 logarithm of the number.
static inline
uint major_key(const uint64 n)
{
assert(n > 0);
uint k1 = uint64_log2(n);
return k1;
}
//------------------------------------------------------------------------------
// Compute minor key, given a nonzero number, its major key, k1, and the
// bit-size b of major bucket k1. The minor key, k2, is is computed by first
// removing the most significant 1-bit from the number, because it adds no
// information, and then extracting the desired number of most significant bits
// from the remainder. For example, given the number n=1463 and a major bucket
// size of b=6 bits, the keys are computed as follows:
//
// Step 0: Given number n = 0b10110110111 = 1463
//
// Step 1: Compute major key: k1 = floor(log_2(n)) = 10
//
// Step 2: Remove high-order 1-bit: n' = 0b0110110111 = 439
//
// Step 3: Compute minor key: k2 = n' >> (k1 - b)
// = 0b0110110111 >> (10 - 6)
// = 0b0110110111 >> 4
// = 0b011011
// = 27
static inline
uint minor_key(const uint64 n, const uint k1, const uint b)
{
assert(n > 0); assert(k1 >= 0); assert(b > 0);
const uint k2 = (uint)((n ^ uint64_pow2(k1)) >> (k1 - b));
return k2;
}
//------------------------------------------------------------------------------
// Raw unsorted factor.
#pragma push(pack, 4)
typedef struct
{
uint64 n; // Value of factor.
uint32 k1; // Major key.
uint32 k2; // Minor key.
}
UnsortedFactor;
#pragma pop(pack)
//------------------------------------------------------------------------------
// Compute sorted list of factors, given a prime-power factorization.
static uint64 memory_usage;
uint64 *compute_factors(const PrimePowerFactorization ppf)
{
memory_usage = 0;
if (ppf.n == 0)
return NULL;
uint64 *sorted_factors = calloc(ppf.d, sizeof(*sorted_factors));
if (!sorted_factors)
fatal_error("Failed to allocate array of %"PRIu32" factors.", ppf.d);
memory_usage += ppf.d * sizeof(*sorted_factors);
UnsortedFactor *unsorted_factors = malloc(ppf.d * sizeof(*unsorted_factors));
if (!unsorted_factors)
fatal_error("Failed to allocate array of %"PRIu32" factors.", ppf.d);
memory_usage += ppf.d * sizeof(*unsorted_factors);
// These arrays are indexed by the major key of a number.
uint32 major_counts[64]; // Counts of factors in major buckets.
uint32 major_spans[64]; // Counts rounded up to power of 2.
uint32 major_bits[64]; // Base-2 logarithm of bucket size.
uint32 major_indexes[64]; // Indexes into minor array.
memset(major_counts, 0, sizeof(major_counts));
memset(major_spans, 0, sizeof(major_spans));
memset(major_bits, 0, sizeof(major_bits));
memset(major_indexes, 0, sizeof(major_indexes));
// --- Step 1: Produce unsorted list of factors from prime-power
// factorization. At the same time, count groups of factors by their
// major keys.
{
// This array is for counting in the multi-radix number system dictated by
// the exponents of the prime-power factorization. An invariant is that
// e[i] <= ppf.e[i] for all i (0 < i >= ppf.e[i]; // Divide n by 2 ** ppf.e[i].
else
n *= pe_inv[i]; // Divide n by ppf.p[i] ** ppf.e[i].
e[i] = 0;
}
else // Carrying is not occurring.
{
n *= ppf.p[i];
e[i] += 1;
break;
}
}
}
assert(n == 1); // n always cycles back to 1, not to ppf.n.
assert(unsorted_factors[ppf.d-1].n == ppf.n);
}
// --- Step 2: Define the major bits array, the major spans array, the major
// index array, and count the total spans.
uint32 total_spans = 0;
{
uint32 k = 0;
for (uint k1 = 0; k1 < ARRAY_CAPACITY(major_counts); k1++)
{
uint32 count = major_counts[k1];
uint32 bits = (count <= 1)? count : uint64_log2(count - 1) + 1;
major_bits[k1] = bits;
major_spans[k1] = (count > 0)? (UINT32_C(1) << bits) : 0;
major_indexes[k1] = k;
k += major_spans[k1];
}
total_spans = k;
}
// --- Step 3: Allocate and populate the minor counts array. Note that it
// must be initialized to zero.
uint32 *minor_counts = calloc(total_spans, sizeof(*minor_counts));
if (!minor_counts)
fatal_error("Failed to allocate array of %"PRIu32" counts.", total_spans);
memory_usage += total_spans * sizeof(*minor_counts);
for (uint k = 0; k < ppf.d; k++)
{
const uint64 n = unsorted_factors[k].n;
const uint k1 = unsorted_factors[k].k1;
const uint k2 = minor_key(n, k1, major_bits[k1]);
assert(k2 < major_spans[k1]);
unsorted_factors[k].k2 = k2;
minor_counts[major_indexes[k1] + k2] += 1;
}
// --- Step 4: Define the minor indexes array.
//
// NOTE: Instead of allocating a separate array, the earlier-allocated array
// of minor indexes is simply repurposed here using an alias.
uint32 *minor_indexes = minor_counts; // Alias the array for repurposing.
{
uint32 k = 0;
for (uint i = 0; i < total_spans; i++)
{
uint32 count = minor_counts[i]; // This array is the same array...
minor_indexes[i] = k; // ...as this array.
k += count;
}
}
// --- Step 5: Populate the sorted factors array. Note that the array must
// be initialized to zero earlier because values of zero are used
// as sentinels in the bucket lists.
for (uint32 i = 0; i < ppf.d; i++)
{
uint64 n = unsorted_factors[i].n;
const uint k1 = unsorted_factors[i].k1;
const uint k2 = unsorted_factors[i].k2;
// Insert factor into bucket using insertion sort (which happens to be
// extremely fast because we know the bucket sizes are always very small).
uint32 k;
for (k = minor_indexes[major_indexes[k1] + k2];
sorted_factors[k] != 0;
k++)
{
assert(k < ppf.d);
if (sorted_factors[k] > n)
{ uint64 t = sorted_factors[k]; sorted_factors[k] = n; n = t; }
}
sorted_factors[k] = n;
}
// --- Step 6: Validate array of sorted factors.
{
for (uint32 k = 1; k < ppf.d; k++)
{
if (sorted_factors[k] == 0)
fatal_error("Produced a factor of 0 at index %"PRIu32".", k);
if (ppf.n % sorted_factors[k] != 0)
fatal_error("Produced non-factor %"PRIu64" at index %"PRIu32".",
sorted_factors[k], k);
if (sorted_factors[k-1] == sorted_factors[k])
fatal_error("Duplicate factor %"PRIu64" at index %"PRIu32".",
sorted_factors[k], k);
if (sorted_factors[k-1] > sorted_factors[k])
fatal_error("Out-of-order factors %"PRIu64" and %"PRIu64" "
"at indexes %"PRIu32" and %"PRIu32".",
sorted_factors[k-1], sorted_factors[k], k-1, k);
}
}
free(minor_counts);
free(unsorted_factors);
return sorted_factors;
}
//------------------------------------------------------------------------------
// Compute prime-power factorization of a 64-bit value. Note that this function
// is designed to be fast *only* for numbers with very simple factorizations,
// e.g., those that produce large factor lists. Do not attempt to factor
// large semiprimes with this function. (The author does know how to factor
// large numbers efficiently; however, efficient factorization is beyond the
// scope of this small test program.)
PrimePowerFactorization compute_ppf(const uint64 n)
{
PrimePowerFactorization ppf;
if (n == 0)
{
ppf = (PrimePowerFactorization){ .ω = 0, .Ω = 0, .d = 0, .n = 0 };
}
else if (n == 1)
{
ppf = (PrimePowerFactorization){ .p = { 1 }, .e = { 1 },
.ω = 1, .Ω = 1, .d = 1, .n = 1 };
}
else
{
ppf = (PrimePowerFactorization){ .ω = 0, .Ω = 0, .d = 1, .n = n };
uint64 m = n;
for (uint64 p = 2; p * p <= m; p += 1 + (p > 2))
{
if (m % p == 0)
{
assert(ppf.ω <= MAX_ω);
ppf.p[ppf.ω] = p;
ppf.e[ppf.ω] = 0;
while (m % p == 0)
{ m /= p; ppf.e[ppf.ω] += 1; }
ppf.d *= (1 + ppf.e[ppf.ω]);
ppf.Ω += ppf.e[ppf.ω];
ppf.ω += 1;
}
}
if (m > 1)
{
assert(ppf.ω <= MAX_ω);
ppf.p[ppf.ω] = m;
ppf.e[ppf.ω] = 1;
ppf.d *= 2;
ppf.Ω += 1;
ppf.ω += 1;
}
}
return ppf;
}
//------------------------------------------------------------------------------
// Parse prime-power factorization from a list of ASCII-encoded base-10 strings.
// The values are assumed to be 2-tuples (p,e) of prime p and exponent e.
// Primes must not exceed 2^64 - 1. Exponents must not exceed 2^8 - 1. The
// constructed value must not exceed 2^64 - 1.
PrimePowerFactorization parse_ppf(const uint pairs, const char *const values[])
{
assert(pairs <= MAX_ω);
PrimePowerFactorization ppf;
if (pairs == 0)
{
ppf = (PrimePowerFactorization){ .ω = 0, .Ω = 0, .d = 0, .n = 0 };
}
else
{
ppf = (PrimePowerFactorization){ .ω = 0, .Ω = 0, .d = 1, .n = 1 };
for (uint i = 0; i < pairs; i++)
{
ppf.p[i] = (uint64)strtoumax(values[(i*2)+0], NULL, 10);
ppf.e[i] = (uint8)strtoumax(values[(i*2)+1], NULL, 10);
// Validate prime value.
if (ppf.p[i] < 2) // (Ideally this would actually do a primality test.)
fatal_error("Factor %"PRIu64" is invalid.", ppf.p[i]);
// Accumulate count of unique prime factors.
if (ppf.ω > UINT8_MAX - 1)
fatal_error("Small-omega overflow at factor %"PRIu64"^%"PRIu8".",
ppf.p[i], ppf.e[i]);
ppf.ω += 1;
// Accumulate count of total prime factors.
if (ppf.Ω > UINT8_MAX - ppf.e[i])
fatal_error("Big-omega wverflow at factor %"PRIu64"^%"PRIu8".",
ppf.p[i], ppf.e[i]);
ppf.Ω += ppf.e[i];
// Accumulate total divisor count.
if (ppf.d > UINT32_MAX / (1 + ppf.e[i]))
fatal_error("Divisor count overflow at factor %"PRIu64"^%"PRIu8".",
ppf.p[i], ppf.e[i]);
ppf.d *= (1 + ppf.e[i]);
// Accumulate value.
for (uint8 k = 1; k <= ppf.e[i]; k++)
{
if (ppf.n > UINT64_MAX / ppf.p[i])
fatal_error("Value overflow at factor %"PRIu64".", ppf.p[i]);
ppf.n *= ppf.p[i];
}
}
}
return ppf;
}
//------------------------------------------------------------------------------
// Main control. Parse command line and produce list of factors.
int main(const int argc, const char *const argv[])
{
PrimePowerFactorization ppf;
uint values = (uint)argc - 1; // argc is always guaranteed to be at least 1.
if (values == 1)
{
ppf = compute_ppf((uint64)strtoumax(argv[1], NULL, 10));
}
else
{
if (values % 2 != 0)
fatal_error("Odd number of arguments (%u) given.", values);
uint pairs = values / 2;
ppf = parse_ppf(pairs, &argv[1]);
}
// Run for (as close as possible to) a fixed amount of time, tallying the
// elapsed CPU time.
uint64 iterations = 0;
double cpu_time = 0.0;
const double cpu_time_limit = 0.05;
while (cpu_time < cpu_time_limit)
{
clock_t clock_start = clock();
uint64 *factors = compute_factors(ppf);
clock_t clock_end = clock();
cpu_time += (double)(clock_end - clock_start) / (double)CLOCKS_PER_SEC;
if (++iterations == 1)
{
for (uint32 i = 0; i < ppf.d; i++)
printf("%"PRIu64"\n", factors[i]);
}
if (factors) free(factors);
}
// Print the average amount of CPU time required for each iteration.
uint mem_scale = (memory_usage >= 1e9)? 9:
(memory_usage >= 1e6)? 6:
(memory_usage >= 1e3)? 3:
0;
char *mem_units = (mem_scale == 9)? "GB":
(mem_scale == 6)? "MB":
(mem_scale == 3)? "KB":
"B";
printf("%"PRIu64" %"PRIu32" factors %.6f ms %.3f ns/factor %.3f %s\n",
ppf.n,
ppf.d,
cpu_time/iterations * 1e3,
cpu_time/iterations * 1e9 / (double)(ppf.d? ppf.d : 1),
(double)memory_usage / pow(10, mem_scale),
mem_units);
return 0;
}