C#: How to make Sieve of Atkin incremental

前端 未结 4 446
遥遥无期
遥遥无期 2020-12-01 17:41

I don\'t know if this is possible or not, but I just gotta ask. My mathematical and algorithmic skills are kind of failing me here :P

The thing is I now have this cl

4条回答
  •  予麋鹿
    予麋鹿 (楼主)
    2020-12-01 18:21

    The following code does the optimizations as discussed at the bottom of my previous answer and includes the following features:

    1. The usable range has been increased to the 64-but unsigned number range of 18,446,744,073,709,551,615 with the range overflow checks removed since it is unlikely that one would want to run the program for the hundreds of years it would take to process the full range of numbers to that limit. This is at very little cost in processing time as the paging can be done using 32-bit page ranges and only the final prime output needs to be computed as a 64-bit number.
    2. It has increased the wheel factorization from a 2,3,5 wheel to use a 2,3,5,7 prime factor wheel with an additional pre-cull of composite numbers using the the additional primes of 11, 13, and 17, to greatly reduce the redundant composite number culling (now only culling each composite number an average of about 1.5 times). Due to the (DotNet related) computational overheads of doing this (also applies for the 2,3,5 wheel as the previous version) the actual time saving in culling isn't all that great but enumerating the answers is somewhat faster due to many of the "simple" composite numbers being skipped in the packed bit representation.
    3. It still uses the Task Parallel Library (TPL) from DotNet 4 and up for multi-threading from the thread pool on a per page basis.
    4. It now uses a base primes representation that supports automatically growing the array contained by this class as more base primes are required as a thread safe method instead of the fixed pre-computed base primes array used previously.
    5. The base primes representation has been reduced to one byte per base prime for a further reduction in memory footprint; thus, the total memory footprint other than the code is the array to hold this base primes representation for the primes up to the square root of the current range being processed, and the packed bit page buffers which are currently set at under the L2 cache size of 256 Kilobytes (smallest page size of 14,586 bytes times the CHNKSZ of 17 as supplied) each per CPU core plus one extra buffer for the foreground task to process. With this code, about three Megabytes is sufficient to process the prime range up to ten to the fourteenth power. As well as speed due to allowing efficient multiprocessing, this reduce memory requirement is the other advantage of using a paged sieve implementation.

      class UltimatePrimesSoE : IEnumerable {
        static readonly uint NUMPRCSPCS = (uint)Environment.ProcessorCount + 1; const uint CHNKSZ = 17;
        const int L1CACHEPOW = 14, L1CACHESZ = (1 << L1CACHEPOW), MXPGSZ = L1CACHESZ / 2; //for buffer ushort[]
        //the 2,3,57 factorial wheel increment pattern, (sum) 48 elements long, starting at prime 19 position
        static readonly byte[] WHLPTRN = { 2,3,1,3,2,1,2,3,3,1,3,2,1,3,2,3,4,2,1,2,1,2,4,3,
                                                                                 2,3,1,2,3,1,3,3,2,1,2,3,1,3,2,1,2,1,5,1,5,1,2,1 }; const uint FSTCP = 11;
        static readonly byte[] WHLPOS; static readonly byte[] WHLNDX; //to look up wheel indices from position index
        static readonly byte[] WHLRNDUP; //to look up wheel rounded up index positon values, allow for overfolw
        static readonly uint WCRC = WHLPTRN.Aggregate(0u, (acc, n) => acc + n);
        static readonly uint WHTS = (uint)WHLPTRN.Length; static readonly uint WPC = WHTS >> 4;
        static readonly byte[] BWHLPRMS = { 2, 3, 5, 7, 11, 13, 17 }; const uint FSTBP = 19;
        static readonly uint BWHLWRDS = BWHLPRMS.Aggregate(1u, (acc, p) => acc * p) / 2 / WCRC * WHTS / 16;
        static readonly uint PGSZ = MXPGSZ / BWHLWRDS * BWHLWRDS; static readonly uint PGRNG = PGSZ * 16 / WHTS * WCRC;
        static readonly uint BFSZ = CHNKSZ * PGSZ, BFRNG = CHNKSZ * PGRNG; //number of uints even number of caches in chunk
        static readonly ushort[] MCPY; //a Master Copy page used to hold the lower base primes preculled version of the page
        struct Wst { public ushort msk; public byte mlt; public byte xtr; public ushort nxt; }
        static readonly byte[] PRLUT; /*Wheel Index Look Up Table */ static readonly Wst[] WSLUT; //Wheel State Look Up Table
        static readonly byte[] CLUT; // a Counting Look Up Table for very fast counting of primes
        static int count(uint bitlim, ushort[] buf) { //very fast counting
          if (bitlim < BFRNG) { var addr = (bitlim - 1) / WCRC; var bit = WHLNDX[bitlim - addr * WCRC] - 1; addr *= WPC;
            for (var i = 0; i < 3; ++i) buf[addr++] |= (ushort)((unchecked((ulong)-2) << bit) >> (i << 4)); }
          var acc = 0; for (uint i = 0, w = 0; i < bitlim; i += WCRC)
            acc += CLUT[buf[w++]] + CLUT[buf[w++]] + CLUT[buf[w++]]; return acc; }
        static void cull(ulong lwi, ushort[] b) { ulong nlwi = lwi;
          for (var i = 0u; i < b.Length; nlwi += PGRNG, i += PGSZ) MCPY.CopyTo(b, i); //copy preculled lower base primes.
          for (uint i = 0, pd = 0; ; ++i) { pd += (uint)baseprms[i] >> 6;
            var wi = baseprms[i] & 0x3Fu; var wp = (uint)WHLPOS[wi]; var p = pd * WCRC + PRLUT[wi];
            var pp = (p - FSTBP) >> 1; var k = (ulong)p * (pp + ((FSTBP - 1) >> 1)) + pp;
            if (k >= nlwi) break; if (k < lwi) { k = (lwi - k) % (WCRC * p);
              if (k != 0) { var nwp = wp + (uint)((k + p - 1) / p); k = (WHLRNDUP[nwp] - wp) * p - k;
                if (nwp >= WCRC) wp = 0; else wp = nwp; } }
            else k -= lwi; var kd = k / WCRC; var kn = WHLNDX[k - kd * WCRC];
            for (uint wrd = (uint)kd * WPC + (uint)(kn >> 4), ndx = wi * WHTS + kn; wrd < b.Length; ) {
              var st = WSLUT[ndx]; b[wrd] |= st.msk; wrd += st.mlt * pd + st.xtr; ndx = st.nxt; } } }
        static Task cullbf(ulong lwi, ushort[] b, Action f) {
          return Task.Factory.StartNew(() => { cull(lwi, b); f(b); }); }
        class Bpa {   //very efficient auto-resizing thread-safe read-only indexer class to hold the base primes array
          byte[] sa = new byte[0]; uint lwi = 0, lpd = 0; object lck = new object();
          public uint this[uint i] { get { if (i >= this.sa.Length) lock (this.lck) {
                  var lngth = this.sa.Length; while (i >= lngth) {
                    var bf = (ushort[])MCPY.Clone(); if (lngth == 0) {
                      for (uint bi = 0, wi = 0, w = 0, msk = 0x8000, v = 0; w < bf.Length;
                          bi += WHLPTRN[wi++], wi = (wi >= WHTS) ? 0 : wi) {
                        if (msk >= 0x8000) { msk = 1; v = bf[w++]; } else msk <<= 1;
                        if ((v & msk) == 0) { var p = FSTBP + (bi + bi); var k = (p * p - FSTBP) >> 1;
                          if (k >= PGRNG) break; var pd = p / WCRC; var kd = k / WCRC; var kn = WHLNDX[k - kd * WCRC];
                          for (uint wrd = kd * WPC + (uint)(kn >> 4), ndx = wi * WHTS + kn; wrd < bf.Length; ) {
                            var st = WSLUT[ndx]; bf[wrd] |= st.msk; wrd += st.mlt * pd + st.xtr; ndx = st.nxt; } } } }
                    else { this.lwi += PGRNG; cull(this.lwi, bf); }
                    var c = count(PGRNG, bf); var na = new byte[lngth + c]; sa.CopyTo(na, 0);
                    for (uint p = FSTBP + (this.lwi << 1), wi = 0, w = 0, msk = 0x8000, v = 0;
                        lngth < na.Length; p += (uint)(WHLPTRN[wi++] << 1), wi = (wi >= WHTS) ? 0 : wi) {
                      if (msk >= 0x8000) { msk = 1; v = bf[w++]; } else msk <<= 1; if ((v & msk) == 0) {
                        var pd = p / WCRC; na[lngth++] = (byte)(((pd - this.lpd) << 6) + wi); this.lpd = pd; }
                    } this.sa = na; } } return this.sa[i]; } } }
        static readonly Bpa baseprms = new Bpa();
        static UltimatePrimesSoE() {
          WHLPOS = new byte[WHLPTRN.Length + 1]; //to look up wheel position index from wheel index
          for (byte i = 0, acc = 0; i < WHLPTRN.Length; ++i) { acc += WHLPTRN[i]; WHLPOS[i + 1] = acc; }
          WHLNDX = new byte[WCRC + 1]; for (byte i = 1; i < WHLPOS.Length; ++i) {
            for (byte j = (byte)(WHLPOS[i - 1] + 1); j <= WHLPOS[i]; ++j) WHLNDX[j] = i; }
          WHLRNDUP = new byte[WCRC * 2]; for (byte i = 1; i < WHLRNDUP.Length; ++i) {
            if (i > WCRC) WHLRNDUP[i] = (byte)(WCRC + WHLPOS[WHLNDX[i - WCRC]]); else WHLRNDUP[i] = WHLPOS[WHLNDX[i]]; }
          Func nmbts = (v) => { var acc = 0; while (v != 0) { acc += (int)v & 1; v >>= 1; } return acc; };
          CLUT = new byte[1 << 16]; for (var i = 0; i < CLUT.Length; ++i) CLUT[i] = (byte)nmbts((ushort)(i ^ -1));
          PRLUT = new byte[WHTS]; for (var i = 0; i < PRLUT.Length; ++i) {
            var t = (uint)(WHLPOS[i] * 2) + FSTBP; if (t >= WCRC) t -= WCRC; if (t >= WCRC) t -= WCRC; PRLUT[i] = (byte)t; }
          WSLUT = new Wst[WHTS * WHTS]; for (var x = 0u; x < WHTS; ++x) {
            var p = FSTBP + 2u * WHLPOS[x]; var pr = p % WCRC;
            for (uint y = 0, pos = (p * p - FSTBP) / 2; y < WHTS; ++y) {
              var m = WHLPTRN[(x + y) % WHTS];
              pos %= WCRC; var posn = WHLNDX[pos]; pos += m * pr; var nposd = pos / WCRC; var nposn = WHLNDX[pos - nposd * WCRC];
              WSLUT[x * WHTS + posn] = new Wst { msk = (ushort)(1 << (int)(posn & 0xF)), mlt = (byte)(m * WPC),
                                                 xtr = (byte)(WPC * nposd + (nposn >> 4) - (posn >> 4)),
                                                 nxt = (ushort)(WHTS * x + nposn) }; } }
          MCPY = new ushort[PGSZ]; foreach (var lp in BWHLPRMS.SkipWhile(p => p < FSTCP)) { var p = (uint)lp;
            var k = (p * p - FSTBP) >> 1; var pd = p / WCRC; var kd = k / WCRC; var kn = WHLNDX[k - kd * WCRC];
            for (uint w = kd * WPC + (uint)(kn >> 4), ndx = WHLNDX[(2 * WCRC + p - FSTBP) / 2] * WHTS + kn; w < MCPY.Length; ) {
              var st = WSLUT[ndx]; MCPY[w] |= st.msk; w += st.mlt * pd + st.xtr; ndx = st.nxt; } } }
        struct PrcsSpc { public Task tsk; public ushort[] buf; }
        class nmrtr : IEnumerator, IEnumerator, IDisposable {
          PrcsSpc[] ps = new PrcsSpc[NUMPRCSPCS]; ushort[] buf;
          public nmrtr() { for (var s = 0u; s < NUMPRCSPCS; ++s) ps[s] = new PrcsSpc { buf = new ushort[BFSZ] };
            for (var s = 1u; s < NUMPRCSPCS; ++s) {
              ps[s].tsk = cullbf((s - 1u) * BFRNG, ps[s].buf, (bfr) => { }); } buf = ps[0].buf; }
          ulong _curr, i = (ulong)-WHLPTRN[WHTS - 1]; int b = -BWHLPRMS.Length - 1; uint wi = WHTS - 1; ushort v, msk = 0;
          public ulong Current { get { return this._curr; } } object IEnumerator.Current { get { return this._curr; } }
          public bool MoveNext() {
            if (b < 0) { if (b == -1) b += buf.Length; //no yield!!! so automatically comes around again
              else { this._curr = (ulong)BWHLPRMS[BWHLPRMS.Length + (++b)]; return true; } }
            do {
              i += WHLPTRN[wi++]; if (wi >= WHTS) wi = 0; if ((this.msk <<= 1) == 0) {
                if (++b >= BFSZ) { b = 0; for (var prc = 0; prc < NUMPRCSPCS - 1; ++prc) ps[prc] = ps[prc + 1];
                  ps[NUMPRCSPCS - 1u].buf = buf;
                  ps[NUMPRCSPCS - 1u].tsk = cullbf(i + (NUMPRCSPCS - 1u) * BFRNG, buf, (bfr) => { });
                  ps[0].tsk.Wait(); buf = ps[0].buf; } v = buf[b]; this.msk = 1; } }
            while ((v & msk) != 0u); _curr = FSTBP + i + i; return true; }
          public void Reset() { throw new Exception("Primes enumeration reset not implemented!!!"); }
          public void Dispose() { } }
        public IEnumerator GetEnumerator() { return new nmrtr(); }
        IEnumerator IEnumerable.GetEnumerator() { return new nmrtr(); }
        static void IterateTo(ulong top_number, Action actn) {
          PrcsSpc[] ps = new PrcsSpc[NUMPRCSPCS]; for (var s = 0u; s < NUMPRCSPCS; ++s) ps[s] = new PrcsSpc {
            buf = new ushort[BFSZ], tsk = Task.Factory.StartNew(() => { }) };
          var topndx = (top_number - FSTBP) >> 1; for (ulong ndx = 0; ndx <= topndx; ) {
            ps[0].tsk.Wait(); var buf = ps[0].buf; for (var s = 0u; s < NUMPRCSPCS - 1; ++s) ps[s] = ps[s + 1];
            var lowi = ndx; var nxtndx = ndx + BFRNG; var lim = topndx < nxtndx ? (uint)(topndx - ndx + 1) : BFRNG;
            ps[NUMPRCSPCS - 1] = new PrcsSpc { buf = buf, tsk = cullbf(ndx, buf, (b) => actn(lowi, lim, b)) };
            ndx = nxtndx; } for (var s = 0u; s < NUMPRCSPCS; ++s) ps[s].tsk.Wait(); }
        public static long CountTo(ulong top_number) {
          if (top_number < FSTBP) return BWHLPRMS.TakeWhile(p => p <= top_number).Count();
          var cnt = (long)BWHLPRMS.Length;
          IterateTo(top_number, (lowi, lim, b) => { Interlocked.Add(ref cnt, count(lim, b)); }); return cnt; }
        public static ulong SumTo(uint top_number) {
          if (top_number < FSTBP) return (ulong)BWHLPRMS.TakeWhile(p => p <= top_number).Aggregate(0u, (acc, p) => acc += p);
          var sum = (long)BWHLPRMS.Aggregate(0u, (acc, p) => acc += p);
          Func sumbf = (lowi, bitlim, buf) => {
            var acc = 0L; for (uint i = 0, wi = 0, msk = 0x8000, w = 0, v = 0; i < bitlim;
                i += WHLPTRN[wi++], wi = wi >= WHTS ? 0 : wi) {
              if (msk >= 0x8000) { msk = 1; v = buf[w++]; } else msk <<= 1;
              if ((v & msk) == 0) acc += (long)(FSTBP + ((lowi + i) << 1)); } return acc; };
          IterateTo(top_number, (pos, lim, b) => { Interlocked.Add(ref sum, sumbf(pos, lim, b)); }); return (ulong)sum; }
        static void IterateUntil(Func prdct) {
          PrcsSpc[] ps = new PrcsSpc[NUMPRCSPCS];
          for (var s = 0u; s < NUMPRCSPCS; ++s) { var buf = new ushort[BFSZ];
            ps[s] = new PrcsSpc { buf = buf, tsk = cullbf(s * BFRNG, buf, (bfr) => { }) }; }
          for (var ndx = 0UL; ; ndx += BFRNG) {
            ps[0].tsk.Wait(); var buf = ps[0].buf; var lowi = ndx; if (prdct(lowi, buf)) break;
            for (var s = 0u; s < NUMPRCSPCS - 1; ++s) ps[s] = ps[s + 1];
            ps[NUMPRCSPCS - 1] = new PrcsSpc { buf = buf,
                                               tsk = cullbf(ndx + NUMPRCSPCS * BFRNG, buf, (bfr) => { }) }; } }
        public static ulong ElementAt(long n) {
          if (n < BWHLPRMS.Length) return (ulong)BWHLPRMS.ElementAt((int)n);
          long cnt = BWHLPRMS.Length; var ndx = 0UL; var cycl = 0u; var bit = 0u; IterateUntil((lwi, bfr) => {
            var c = count(BFRNG, bfr); if ((cnt += c) < n) return false; ndx = lwi; cnt -= c; c = 0;
            do { var w = cycl++ * WPC; c = CLUT[bfr[w++]] + CLUT[bfr[w++]] + CLUT[bfr[w]]; cnt += c; } while (cnt < n);
            cnt -= c; var y = (--cycl) * WPC; ulong v = ((ulong)bfr[y + 2] << 32) + ((ulong)bfr[y + 1] << 16) + bfr[y];
            do { if ((v & (1UL << ((int)bit++))) == 0) ++cnt; } while (cnt <= n); --bit; return true;
          }); return FSTBP + ((ndx + cycl * WCRC + WHLPOS[bit]) << 1); } }
      

    The above code takes about 59 milliseconds to find the primes to two million (slightly slower than some of the other simpler codes due to initialization overhead), but calculates the primes to one billion and the full number range in 1.55 and 5.95 seconds, respectively. This isn't much faster than the last version due to the DotNet extra overhead of an extra array bound check in the enumeration of found primes compared to the time expended in culling composite numbers which is less than a third of the time spent emumerating, so the saving in culling composites is cancelled out by the extra time (due to an extra array bounds check per prime candidate) in the enumeration. However, for many tasks involving primes, one does not need to enumerate all primes but can just compute the answers without enumeration.

    For the above reasons, this class provides the example static methods "CountTo", "SumTo", and "ElementAt" to count or sum the primes to a given upper limit or to output the zero-based nth prime, respectively. The "CountTo" method will produce the number of primes to one billion and in the 32-bit number range in about 0.32 and 1.29 seconds, respectively; the "ElementAt" method will produce the last element in these ranges in about 0.32 and 1.25 seconds, respectively, and the "SumTo" method produces the sum of all the primes in these ranges in about 0.49 and 1.98 seconds respectively. This program calculates the sum of all the prime numbers to four billion plus as here in less time than many naive implementations can sum all the prime numbers to two million as in Euler Problem 10, for over 2000 times the practical range!

    This code is only about four times slower than very highly optimized C code used by primesieve, and the reasons it is slower are mostly due to DotNet, as follows (discussing the case of a 256 Kilobyte buffer, which is the size of the L2 cache):

    1. Most of the execution time is spent in the main composite culling loop, which is the last "for loop" in the private static "cull" method" and only contains four statements per loop plus the range check.
    2. In DotNet, this compiles to take about 21.83 CPU clock cycles per loop, including about 5 clock cycles for the two array bounds checks per loop.
    3. The very efficient C compiler converts this loop into only about 8.33 clock cycles for an advantage of about 2.67 times.
    4. Primesieve also uses extreme manual "loop unrolling" optimizations to reduce the average time to perform the work per loop to about 4.17 clock cycles per composite cull loop, for a additional gain of two times and a total gain of about 5.3 times.
    5. Now, the highly optimized C code both doesn't Hyper Thread (HT) as well as the less efficient Just In Time (JIT) compiler produced code and as well, the OpemMP multi-threading used by primesieve doesn't appear to be as well adapted to this problem as use of the Thread Pool threads here, so the final multi-threaded gain in only about four times.
    6. One might consider the use of "unsafe" pointers to eliminate the array bounds checks, and it was tried, but the JIT compiler does not optimize pointers as well as normal array based code, so the gain of not having array bound checks is cancelled by the less efficient code (every pointer access (re)loads the pointer address from memory rather than using a register already pointing to that address as in the optimized array case).
    7. Primesieve is even faster when using smaller buffer sizes as in the size of the available L1 cache (16 Kilobytes when multi-threading for the i3/i5/i7 CPU's) as its more efficient code has more of an advantage reducing the average memory access time to one clock cycle from about four clock cycles, which advantage makes much less of a difference to the DotNet code, which gains more from less processing per reduced number of pages. Thus, primesieve is about five times faster when each use their most efficient buffer size.

    This DotNet code will count (CountTo) the number of primes to ten to the power thirteen (ten trillion) in about an hour and a half (tested) and the number of primes to a hundred trillion (ten to the fourteenth) in just over a half day (estimated) as compared to twenty minutes and under four hours for primesieve, respectively. This is interesting historically as until 1985 only the count of primes in the range to ten to the thirteenth was known since it would take too much execution time on the (expensive) supercomputers of that day to find the range ten times larger; now we can easily compute the number of primes in those ranges on a common desktop computer (in this case, an Intel i7-2700K - 3.5 GHz)!

    Using this code, it is easy to understand why Professor Atkin and Bernstein thought that the SoA was faster than the SoE - a myth that persists to this day, with the reasoning as follows:

    1. It is easy to have any SoA implementation count the number of state toggles and square free composite number culls (which last can be optimized using the same 2,3,5 wheel optimization as used inherently by the SoA algorithm) to determine that the total number of both of these operations is about 1.4 billion for the 32-bit number range.
    2. Bernstein's "equivalent" SoE implementation to his SoA implementation (neither of which are as optimized as this code), which uses the same 2,3,5 wheel optimization as the SoA, will have a total of about 1.82 billion cull operations with the same cull loop computational complexity.
    3. Therefore, Bernstein's results of about 30% better performance as compared to his implementation of the SoE is about right, just based on the number of equivalent operations. However, his implementation of the SoE did not take wheel factorization "to the max" since the SoA does not much respond to further degrees of wheel factorization as the 2,3,5 wheel is "baked in" to the basic algorithm.
    4. The wheel factorization optimizations used here reduces the number of composite cull operations to about 1.2 billion for the 32-bit number range; therefore, this algorithm using this degree of wheel factorization will run about 16.7% faster than an equivalent version of SoA since the culling loop(s) can be implemented about the same for each algorithm.
    5. The SoE with this level of optimizations is easier to write than an equivalent SoA as there needs only be one state table look up array to cull across the base primes instead of an additional state look up array's for each of the four quadratic equation solution cases that produce valid state toggles.
    6. When written using equivalent implementations as this code, the code will respond equivalently to C compiler optimizations for SoA as for the SoE used in primesieve.
    7. The SoA implementation will also respond to the extreme manual "loop unrolling" optimization used by primesieve just as effectively as for the SoE implementation.
    8. Therefore, if I went though all the work to implement the SoA algorithm using the techniques as for the above SoE code, the resulting SoA would only be a slight amount slower when the output primes were enumerated but would be about 16.7% slower when using the static direct methods.
    9. Memory footprint of the two algorithms is no different as both require the representation of the base primes and the same number of segment page buffers.

    EDIT_ADD: Interestingly, this code runs 30% faster in x86 32-bit mode than in x64 64-bit mode, likely due to avoiding the slight extra overhead of extending the uint32 numbers to ulong's. All of the above timings are for 64-bit mode. END_EDIT_ADD

    In summary: The segment paged Sieve of Atkin is actually slower than a maximally optimized segment paged Sieve of Eratosthenes with no saving in memory requirements!!!

    I say again: "Why use the Sieve of Atkin?".

提交回复
热议问题