Efficiently computing the first 20-digit substring to repeat in the decimal expansion of Pi

前端 未结 4 778
天涯浪人
天涯浪人 2020-12-15 23:52

Problem

Pi = 3.14159 26 5358979323846 26 433... so the first 2-digit substring to repeat is 26.

What is an efficient way

4条回答
  •  -上瘾入骨i
    2020-12-16 00:12

    Your data set is pretty big, so some sort of "divide and conquer" will be necessary. I would suggest that as a first step, you subdivide the problem into some number of pieces (e.g. 100). Start by seeing if the file has any duplicated 20-digit sequences start with 00, then see if it has any starting with 01, etc. up to 99. Start each of these "main passes" by writing out to a file all of the 20-digit sequences that start with the correct number. If the first two digits are constant, you'll only need to write out the last 18; since an 18-digit decimal number will fit into an 8-byte 'long', the output file will probably hold about 5,000,000,000 numbers, taking up 40GB of disk space. Note that it may be worthwhile to produce more than one output file at a time, so as to avoid having to read every byte of the source file 100 times, but disk performance may be better if you are simply reading one file and writing one file.

    Once one has generated the data file for a particular "main pass", one must then determine whether there are any duplicates in it. Subdividing it into some number of smaller sections based upon the bits in the numbers stored therein may be a good next step. If one subdivides it into 256 smaller sections, each section will have somewhere around 16-32 million numbers; the five gigabytes of RAM one has could be used to buffer a million numbers for each of the 256 buckets. Writing out each chunk of a million numbers would require a random disk seek, but the number of such writes would be pretty reasonable (probably about 10,000 disk seeks).

    Once the data has been subdivided into files of 16-32 million numbers each, simply read each such file into memory and look for duplicates.

    The algorithm as described probably isn't optimal, but it should be reasonably close. Of greatest interest is the fact that cutting in half the number of main passes would cut in half the number of times one had to read through the source data file, but would more than double the time required to process each pass once its data had been copied. I would guess that using 100 passes through the source file probably isn't optimal, but the time required for the overall process using that split factor would be pretty close to the time using the optimal split factor.

提交回复
热议问题