I recently read about a faster implementation of Segmented Sieve of Eratosthenes for really big numbers.
Following is an implementation of the same:
This expands my previous answer as to adding what was promised but for which there wasn't room in the 30,000 character per answer limit:
The non-Maximally Wheel Factorized Page Segmented Sieve of Eratosthenes version from Chapter 3 in the previous answer was written as a Prime Generator with the output recursively fed back as an input for the base prime number feed; although that was very elegant and expandable, in the following work I have taken a step back to a more imperative style of code so the readers can more readily understand the core concepts of the Maximal Wheel Factorization. In a future Chapter 4.5b I will combine the concepts developed in the following example back into a Prime Generator style and add some extra refinements that won't make it any faster for smaller ranges of a few billion, but will make the concept usable without much of a loss of speed up to trillions to hundreds or thousands of trillions; the Prime Generator format is then more useful in making the program adapt as the range gets larger.
The following example's main extra refinements are in the various Look Up Tables (LUT's) used for efficient addressing of the wheel modulo residues, the generation of the special start addresses LUT that quite simply allows one to calculate the culling start address for each modulo residue bit plane given the start address wheel index and first modulo residue bit plane index of the very first cull in the entire structure, and the Sieve Buffer composite number representation culling function that uses these.
This example is based on a 210 number circumference wheel using the small primes of two, three, five, and seven as it seems that hits the "sweet spot" of efficiency for the size of the arrays and numbers of bit planes but experiments have shown that about another 5% in performance can be gained by adding the next prime of eleven to the wheel for a circumference of 2310 numbers; the reason that this wasn't done is that initialization time goes up greatly and that it makes it hard to time for smaller ranges of "only " a billion as there are then only about four segments to reach that point and granularity becomes a problem.
Note that the first sieved number is 23 which is the first prime past the wheel primes and pre-cull primes; by using this, we avoid the problems of dealing with arrays that start from "one" and the problems of restoring the wheel primes that are eliminated and must be added back in by some algorithms.
Basically, for every Page Segment culling sweep, there is a starting loop that fills a start address array with the wheel index and modulo residue index of the first cull address within the segment for each of the base primes less than the square root of the maximum number represented in the Page Segment, then this start address array is used sieving each modulo residue bit plane (48 of them) completely in turn with all base primes scanned for each bit plane and the appropriate offset calculated from the segment start address per base prime by use of the multiplier and offset in the WHLSTARTS LUT. This is done by multiplying the wheel index of the base prime my the looked up multiplier and adding the looked up offset to obtain the wheel index start position for the given modulo residue bit plane. From there on, the per bit plane culling is just like as for Chapter Three for the odd number bit plane. This is done 48 times for each of the bit planes, but the effective range per page segment for a 16 Kilobyte buffer (per bit plane) is 131072 times the 210 wheel span or 27,525,120 numbers per Page Segment.
Use of this sieve reduces memory use as a ratio of the effective range of the segment by a factor of 48 over 105 as compared to the Chapter 3 odds-only sieve or to a factor of less than half, but because each segment has all 48 bit planes, the full Sieve Buffer is 16 Kilobytes times the 48 bit planes or 768 Kilobytes (three quarters of a Megabyte). However, use of a Sieve Buffer of this size is only efficient up to about 16 billion and our next example in the next chapter will adapt the size of the buffer for huge ranges so that it grows to about a hundred Megabytes for the largest ranges. For multi-threaded languages (not JavaScript), this would be a requirement per thread.
Additional memory requirements are for the storage of the base primes array of 32-bit values representing the wheel index for the base prime and its modulo residue index (necessary for the modulo address calculations as explained above); for a range of a billion there are about 3600 base primes times four bytes each is about 14,000 bytes with the additional start address array the same size. These arrays grow as the square root of the maximum value to be sieved, so grow to about 5,761,455 values (times four bytes each) for the base primes less than a hundred million necessary to sieve to 10^16 (ten thousand trillion) or about 23 Megabytes each. Although half of this memory is required per thread, it is much smaller than the memory required for the expanded Sieve Buffer's themselves.
A further refinement is adapted to the following example in using a "combo" sieve where the Sieve Buffer is pre-filled from a larger wheel pattern from which the factors of the primes of eleven, thirteen, seventeen, and nineteen, have been eliminated; bigger ranges of eliminations than this are impractical as the saved pre-culled pattern grows from only about 64 Kilobytes per modulo bit plane to about twenty times that large or about one and a half Megabytes for each of the 48 modulo residue number planes or about sixty Megabytes just by adding the extra factor of the prime of 23 - again, it is quite a large cost in memory and initialization for only a few percent in performance. Do note that this array can be shared for use across all threads.
As implemented, the WHLPTRN array is about 64 Kilobytes times 48 modulo bit planes are about 3 Megabytes, which isn't that large and which is a fixed size not changing with increasing sieving range; this is a quite workable size as to access speed and initialization time.
These "Maximal Wheel Factorization" refinements reduce the total number of composite number culling operation loops for sieving a range of a billion from about a billion operations for the Chapter 3 odds-only example to about a quarter billion for this "combo" sieve, with the goal to try to keep the number of CPU clock cycles the same per culling operation and thus a gain of a factor of four in speed.
EDITED: The following snippet has been adjusted to add a rudimentary HTML Single Web Page Application user interface so that parameters can easily be adjusted for experimentation. For best ease of use, one should use the upper right "Full page" link after clicking the "Run code snippet" button, and can close the full page with a top right link when finished. To run on a smartphone (preferably in Chrome), use the "Desktop site" checkbox in the settings menu (the triple dot menu).
EDIT_CORRECTION: with the range limit able to be easily changed within the specified upper bound, the dummy base base indexed base prime array size is no longer adequate to cover the square root of the square root of the specified upper bound of LIMIT of 362 (previously only 229) so has been increased to two wheel spans or 439.
FURTHER_EDIT_CORRECTION: There was a slight error in the fillSieveBuffer function when filling SieveBuffer residual bit plane buffers of larger than 16384 bytes, which has been corrected
SPEED_OMISSION_CORRECTION: From working with other languages, it was realized that the version is about 20% slower than it should be due to not effectively using "loop unpeeling" as to not calculating and appropriate limit where this should be applied. The "bplmt" as been added and applied to correct this. Upon first running the code, one should press the Sieve button several times to allow the JavaScript engine to hot tune the generated code for optimizations and thus improve speed, which it will reach after four or five iterations.
The JavaScript example as described above is implemented as follows:
"use strict";
const WHLPRMS = new Uint32Array([2,3,5,7,11,13,17,19]);
const FRSTSVPRM = 23;
const WHLODDCRC = 105 | 0;
const WHLHITS = 48 | 0;
const WHLODDGAPS = new Uint8Array([
3, 1, 3, 2, 1, 2, 3, 3, 1, 3, 2, 1, 3, 2, 3, 4,
2, 1, 2, 1, 2, 4, 3, 2, 3, 1, 2, 3, 1, 3, 3, 2,
1, 2, 3, 1, 3, 2, 1, 2, 1, 5, 1, 5, 1, 2, 1, 2 ]);
const RESIDUES = new Uint32Array([
23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71,
73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 121, 127,
131, 137, 139, 143, 149, 151, 157, 163, 167, 169, 173, 179,
181, 187, 191, 193, 197, 199, 209, 211, 221, 223, 227, 229, 233 ]);
const WHLNDXS = new Uint8Array([
0, 0, 0, 1, 2, 2, 2, 3, 3, 4, 5, 5, 6, 6, 6,
7, 7, 7, 8, 9, 9, 9, 10, 10, 11, 12, 12, 12, 13, 13,
14, 14, 14, 15, 15, 15, 15, 16, 16, 17, 18, 18, 19, 20, 20,
21, 21, 21, 21, 22, 22, 22, 23, 23, 24, 24, 24, 25, 26, 26,
27, 27, 27, 28, 29, 29, 29, 30, 30, 30, 31, 31, 32, 33, 33,
34, 34, 34, 35, 36, 36, 36, 37, 37, 38, 39, 39, 40, 41, 41,
41, 41, 41, 42, 43, 43, 43, 43, 43, 44, 45, 45, 46, 47, 47, 48 ]);
const WHLRNDUPS = new Uint8Array( // two rounds to avoid overflow, used in start address calcs...
[ 0, 3, 3, 3, 4, 7, 7, 7, 9, 9, 10, 12, 12, 15, 15,
15, 18, 18, 18, 19, 22, 22, 22, 24, 24, 25, 28, 28, 28, 30,
30, 33, 33, 33, 37, 37, 37, 37, 39, 39, 40, 42, 42, 43, 45,
45, 49, 49, 49, 49, 52, 52, 52, 54, 54, 57, 57, 57, 58, 60,
60, 63, 63, 63, 64, 67, 67, 67, 70, 70, 70, 72, 72, 73, 75,
75, 78, 78, 78, 79, 82, 82, 82, 84, 84, 85, 87, 87, 88, 93,
93, 93, 93, 93, 94, 99, 99, 99, 99, 99, 100, 102, 102, 103, 105,
105, 108, 108, 108, 109, 112, 112, 112, 114, 114, 115, 117, 117, 120, 120,
120, 123, 123, 123, 124, 127, 127, 127, 129, 129, 130, 133, 133, 133, 135,
135, 138, 138, 138, 142, 142, 142, 142, 144, 144, 145, 147, 147, 148, 150,
150, 154, 154, 154, 154, 157, 157, 157, 159, 159, 162, 162, 162, 163, 165,
165, 168, 168, 168, 169, 172, 172, 172, 175, 175, 175, 177, 177, 178, 180,
180, 183, 183, 183, 184, 187, 187, 187, 189, 189, 190, 192, 192, 193, 198,
198, 198, 198, 198, 199, 204, 204, 204, 204, 204, 205, 207, 207, 208, 210, 210 ]);
const WHLSTARTS = function () {
let arr = new Array(WHLHITS);
for (let i = 0; i < WHLHITS; ++i) arr[i] = new Uint16Array(WHLHITS * WHLHITS);
for (let pi = 0; pi < WHLHITS; ++pi) {
let mltsarr = new Uint16Array(WHLHITS);
let p = RESIDUES[pi]; let i = (p - FRSTSVPRM) >> 1;
let s = ((i << 1) * (i + FRSTSVPRM) + (FRSTSVPRM * ((FRSTSVPRM - 1) >> 1))) | 0;
// build array of relative mults and offsets to `s`...
for (let ci = 0; ci < WHLHITS; ++ci) {
let rmlt = (RESIDUES[((pi + ci) % WHLHITS) | 0] - RESIDUES[pi | 0]) >> 1;
rmlt += rmlt < 0 ? WHLODDCRC : 0; let sn = s + p * rmlt;
let snd = (sn / WHLODDCRC) | 0; let snm = (sn - snd * WHLODDCRC) | 0;
mltsarr[WHLNDXS[snm]] = rmlt | 0; // new rmlts 0..209!
}
let ondx = (pi * WHLHITS) | 0
for (let si = 0; si < WHLHITS; ++si) {
let s0 = (RESIDUES[si] - FRSTSVPRM) >> 1; let sm0 = mltsarr[si];
for (let ci = 0; ci < WHLHITS; ++ci) {
let smr = mltsarr[ci];
let rmlt = smr < sm0 ? smr + WHLODDCRC - sm0 : smr - sm0;
let sn = s0 + p * rmlt; let rofs = (sn / WHLODDCRC) | 0;
// we take the multiplier times 2 so it multiplies by the odd wheel index...
arr[ci][ondx + si] = ((rmlt << 9) | (rofs | 0)) >>> 0;
}
}
}
return arr;
}();
const PTRNLEN = (11 * 13 * 17 * 19) | 0;
const PTRNNDXDPRMS = new Int32Array([ // the wheel index plus the modulo index
(-1 << 6) + 44, (-1 << 6) + 45, (-1 << 6) + 46, (-1 << 6) + 47 ]);
function makeSieveBuffer(szbits) { // round up to 32 bit boundary!
let arr = new Array(WHLHITS); let sz = ((szbits + 31) >> 5) << 2;
for (let ri = 0; ri < WHLHITS; ++ri) arr[ri] = new Uint8Array(sz);
return arr;
}
function cullSieveBuffer(lwi, bps, prmstrts, sb) {
let len = sb[0].length; let szbits = len << 3; let bplmt = len >> 4;
let lowndx = lwi * WHLODDCRC; let nxti = (lwi + szbits) * WHLODDCRC;
// set up prmstrts for use by each modulo residue bit plane...
for (let pi = 0, bpslmt = bps.length; pi < bpslmt; ++pi) {
let ndxdprm = bps[pi] | 0;
let prmndx = ndxdprm & 0x3F; let pd = ndxdprm >> 6;
let rsd = RESIDUES[prmndx] | 0; let bp = (pd * (WHLODDCRC << 1) + rsd) | 0;
let i = (bp - FRSTSVPRM) / 2;
let s = (i + i) * (i + FRSTSVPRM) + (FRSTSVPRM * ((FRSTSVPRM - 1) / 2));
if (s >= nxti) { prmstrts[pi] = 0xFFFFFFFF >>> 0; break; } // enough base primes!
if (s >= lowndx) s = (s - lowndx) | 0;
else {
let wp = (rsd - FRSTSVPRM) >>> 1; let r = ((lowndx - s) % (WHLODDCRC * bp)) >>> 0;
s = r == 0
? 0 | 0
: (bp * (WHLRNDUPS[(wp + ((r + bp - 1) / bp) | 0) | 0] - wp) - r) | 0;
}
let sd = (s / WHLODDCRC) | 0; let sn = WHLNDXS[(s - sd * WHLODDCRC) | 0];
prmstrts[pi | 0] = ((sn << 26) | sd) >>> 0;
}
// if (szbits == 131072) return;
for (let ri = 0; ri < WHLHITS; ++ri) {
let pln = sb[ri]; let plnstrts = WHLSTARTS[ri];
for (let pi = 0, bpslmt = bps.length; pi < bpslmt; ++pi) {
let prmstrt = prmstrts[pi | 0] >>> 0; if (prmstrt == 0xFFFFFFFF) break;
let ndxdprm = bps[pi | 0] | 0;
let prmndx = ndxdprm & 0x3F; let pd = ndxdprm >> 6;
let bp = (((pd * (WHLODDCRC << 1)) | 0) + RESIDUES[prmndx]) | 0;
let sd = prmstrt & 0x3FFFFFF; let sn = prmstrt >>> 26;
let adji = (prmndx * WHLHITS + sn) | 0; let adj = plnstrts[adji];
sd += ((((adj >> 8) * pd) | 0) + (adj & 0xFF)) | 0;
if (bp < bplmt) {
for (let slmt = Math.min(szbits, sd + (bp << 3)) | 0; sd < slmt; sd += bp) {
let msk = (1 << (sd & 7)) >>> 0;
// for (let c = sd >> 3, clmt = len == 16384 ? 0 : len; c < clmt; c += bp) pln[c] |= msk;
for (let c = sd >> 3; c < len; c += bp) pln[c] |= msk;
}
}
// else for (let sdlmt = szbits == 131072 ? 0 : szbits; sd < sdlmt; sd += bp) pln[sd >> 3] |= (1 << (sd & 7)) >>> 0;
else for (; sd < szbits; sd += bp) pln[sd >> 3] |= (1 << (sd & 7)) >>> 0;
}
}
}
const WHLPTRN = function () {
let sb = makeSieveBuffer((PTRNLEN + 16384) << 3); // avoid overflow when filling!
cullSieveBuffer(0, PTRNNDXDPRMS, new Uint32Array(PTRNNDXDPRMS.length), sb);
return sb;
}();
const CLUT = function () {
let arr = new Uint8Array(65536);
for (let i = 0; i < 65536; ++i) {
let nmbts = 0|0; let v = i;
while (v > 0) { ++nmbts; v &= (v - 1)|0; }
arr[i] = nmbts|0;
}
return arr;
}();
function countSieveBuffer(bitlmt, sb) {
let lstwi = (bitlmt / WHLODDCRC) | 0;
let lstri = WHLNDXS[(bitlmt - lstwi * WHLODDCRC) | 0];
let lst = lstwi >> 5; let lstm = lstwi & 31;
let count = (lst * 32 + 32) * WHLHITS;
for (let ri = 0; ri < WHLHITS; ++ri) {
let pln = new Uint32Array(sb[ri].buffer);
for (let i = 0; i < lst; ++i) {
let v = pln[i]; count -= CLUT[v & 0xFFFF]; count -= CLUT[v >>> 16];
}
let msk = 0xFFFFFFFF << lstm; if (ri <= lstri) msk <<= 1;
let v = pln[lst] | msk; count -= CLUT[v & 0xFFFF]; count -= CLUT[v >>> 16];
}
return count;
}
function fillSieveBuffer(lwi, sb) {
let len = sb[0].length; let cpysz = len > 16384 ? 16384 : len;
let mod0 = lwi / 8;
for (let ri = 0; ri < WHLHITS; ++ri) {
for (let i = 0; i < len; i += 16384) {
let mod = ((mod0 + i) % PTRNLEN) | 0;
sb[ri].set(WHLPTRN[ri].subarray(mod, (mod + cpysz) | 0), i);
}
}
}
function doit() {
const LIMIT = Math.floor(parseFloat(document.getElementById('limit').value));
if (!Number.isInteger(LIMIT) || (LIMIT < 0) || (LIMIT > 17179869183)) {
document.getElementById('output').innerText = "Top limit must be an integer between 0 and 17179869183!";
return;
}
const CPUL1CACHE = parseInt(document.getElementById('L1').value, 10);
let startx = +Date.now();
let count = 0;
for (let i = 0; i < WHLPRMS.length; ++i) {
if (WHLPRMS[i] > LIMIT) break;
++count;
}
if (LIMIT >= FRSTSVPRM) {
const cmpsts = makeSieveBuffer(CPUL1CACHE);
const bparr = function () {
let szbits = (((((((Math.sqrt(LIMIT) | 0) - 23) >> 1) + WHLODDCRC - 1) / WHLODDCRC)
+ 31) >> 5) << 5;
let cmpsts = makeSieveBuffer(szbits); fillSieveBuffer(0, cmpsts);
let ndxdrsds = new Int32Array(2 * WHLHITS);
for (let i = 0; i < ndxdrsds.length; ++i)
ndxdrsds[i] = ((i < WHLHITS ? 0 : 64) + (i % WHLHITS)) >>> 0;
cullSieveBuffer(0, ndxdrsds, new Uint32Array(ndxdrsds.length), cmpsts);
let len = countSieveBuffer(szbits * WHLODDCRC - 1, cmpsts);
let ndxdprms = new Uint32Array(len); let j = 0;
for (let i = 0; i < szbits; ++i)
for (let ri = 0; ri < WHLHITS; ++ri)
if ((cmpsts[ri][i >> 3] & (1 << (i & 7))) == 0) {
ndxdprms[j++] = ((i << 6) + ri) >>> 0;
}
return ndxdprms;
}();
let lwilmt = (LIMIT - FRSTSVPRM) / 2 / WHLODDCRC;
let strts = new Uint32Array(bparr.length);
for (let lwi = 0; lwi <= lwilmt; lwi += CPUL1CACHE) {
let nxti = lwi + CPUL1CACHE;
fillSieveBuffer(lwi, cmpsts);
cullSieveBuffer(lwi, bparr, strts, cmpsts);
if (nxti <= lwilmt) count += countSieveBuffer(CPUL1CACHE * WHLODDCRC - 1, cmpsts);
else count += countSieveBuffer((LIMIT - FRSTSVPRM) / 2 - lwi * WHLODDCRC, cmpsts);
}
}
let elpsdx = +Date.now() - startx;
document.getElementById('output').innerText = "Found " + count + " primes up to " + LIMIT + " in " + elpsdx + " milliseconds.";
}
document.getElementById('go').onclick = function () {
document.getElementById('output').innerText = "Sieving to limit...";
setTimeout(doit, 7);
};
html,
body {
justify-content: center;
align-items: center;
text-align: center;
font-weight: bold;
font-size: 120%;
margin-bottom: 10px;
}
.title {
font-size: 200%;
}
.input {
font-size: 100%;
padding:5px 15px 5px 15px;
}
.output {
padding:7px 15px 7px 15px;
}
.doit {
font-weight: bold;
font-size: 110%;
border:3px solid black;
background:#F0E5D1;
padding:7px 15px 7px 15px;
}
Page-Segmented Sieve of Eratosthenes...
Page-Segmented Sieve of Eratosthenes
Top integer limit:
The enforced limit is zero to 17179869183, truncated downward!
CPU L1 Cache size
Refresh the page to reset to default values!
Waiting to start...
The above code only runs about three and a half times faster than the Chapter 3 "odds-only" Page Segmentation code rather than the expected four times faster due to the following reasons:
The speed up technique using a special simplified culling loop with a fixed mask pattern is no longer as effective as there are no longer hardly any small culling base primes; this increases the average number of clock cycles per composite number culling operation by about 20%, although this applies more to slower languages such as JavaScript than to the more efficient machine code compiling languages since they can use further techniques such as extreme loop unrolling to further reduce number of CPU clock cycles per culling operation loop to as little as about 1.25 clock cycles per culling loop.
Although the overhead of counting the resulting primes is reduced by about a factor of two due to the less modulo bit planes (by about that factor), that is not the required factor of four; this is made worse in using JavaScript which does not have an means of tapping into the CPU POP_COUNT machine instruction which would be as about ten times faster than the Counting LUT (CLUT) technique used here.
While the LUT techniques used here reduce the start address calculation overhead by a factor of five or so from what they would be for the more complex modulo calculations needed, these calculations are about two and a half to three times more complex than as required for the "odds-only" sieve in Chapter 3 so it isn't enough to give us a ratio-metric reduction; we would need a technique to reduce the time by a further factor of two or so in order for these calculation not to contribute to the reduced ratio. Believe me, I tried, but haven't been able to make this any better. That said, these calculations are likely to be more efficient in a more efficient computer language than JavaScript and/or a better CPU than my very low end Atom CPU processor, but then the composite number culling speed is likely to also be more efficient as well!
Still, a three and a half times speed up with only an increase in number of lines of code of about 50% isn't too bad, is it? This JavaScript code is only three to five times slower (depending on CPU, closer in performance on higher end CPU's) when run on newer versions of node/Google Chrome browser (Chrome version 75 is still about 25% faster than Firefox version 68) than Kim Walisch's "primesieve" written in "C" and compiled to x86_64 native code!
I've added an Appendix answer that shows that one doesn't really need to write JavaScript to generate JavaScript when the project increases in size to more than a couple of hundred lines, as here. In the future, I expect such transpilers as Fable might emit WebAssembly code rather than JavaScript, and the advantage of writing in Fable now is that one will not then have to make changes (or at least few changes) to the code in order to take advantage of the newer technology, which supports faster code execution and (eventually) the multi-threading that JavaScript does not.
Coming is Chapter 4.5b that will be about two and a half times more code but will be a Prime Generator capable of sieving to extremely large ranges limited partly by that JavaScript can only efficiently represent numbers up to the 64-bit floating bit mantissa of 53 bits or about 9e15 and also by the amount of time one wants to wait: Sieving to 1e14 on a better CPU will take in the order of a day, but that isn't much of a problem - just keep a browser tab open!