Efficiently computing the first 20-digit substring to repeat in the decimal expansion of Pi

前端 未结 4 776
天涯浪人
天涯浪人 2020-12-15 23:52

Problem

Pi = 3.14159 26 5358979323846 26 433... so the first 2-digit substring to repeat is 26.

What is an efficient way

相关标签:
4条回答
  • 2020-12-16 00:10

    Trie

    RBarryYoung has pointed out that this will exceed the memory limits.

    A trie data structure might be appropriate. In a single pass you can build up a trie with every prefix you've seen up to length n (e.g., n = 20). As you continue to process, if you ever reach a node at level n that already exists, you've just found a duplicate substring.

    Suffix Matching

    Another approach involves treating the expansion as a character string. This approach finds common suffixes, but you want common prefixes, so start by reversing the string. Create an array of pointers, with each pointer pointing to the next digit in the string. Then sort the pointers using a lexicographic sort. In C, this would be something like:

    qsort(array, number_of_digits, sizeof(array[0]), strcmp);
    

    When the qsort finishes, similar substrings will be adjacent in the pointer array. So for every pointer, you can do a limited string comparison with that string and the one pointed to by the next pointer. Again, in C:

    for (int i = 1; i < number_of_digits; ++i) {
      if (strncmp(array[i - 1], array[i], 20) == 0) {
        // found two substrings that match for at least 20 digits
        // the pointers point to the last digits in the common substrings
      }
    }
    

    The sort is (typically) O(n log_2 n), and the search afterwards is O(n).

    This approach was inspired by this article.

    0 讨论(0)
  • 2020-12-16 00:12

    Your data set is pretty big, so some sort of "divide and conquer" will be necessary. I would suggest that as a first step, you subdivide the problem into some number of pieces (e.g. 100). Start by seeing if the file has any duplicated 20-digit sequences start with 00, then see if it has any starting with 01, etc. up to 99. Start each of these "main passes" by writing out to a file all of the 20-digit sequences that start with the correct number. If the first two digits are constant, you'll only need to write out the last 18; since an 18-digit decimal number will fit into an 8-byte 'long', the output file will probably hold about 5,000,000,000 numbers, taking up 40GB of disk space. Note that it may be worthwhile to produce more than one output file at a time, so as to avoid having to read every byte of the source file 100 times, but disk performance may be better if you are simply reading one file and writing one file.

    Once one has generated the data file for a particular "main pass", one must then determine whether there are any duplicates in it. Subdividing it into some number of smaller sections based upon the bits in the numbers stored therein may be a good next step. If one subdivides it into 256 smaller sections, each section will have somewhere around 16-32 million numbers; the five gigabytes of RAM one has could be used to buffer a million numbers for each of the 256 buckets. Writing out each chunk of a million numbers would require a random disk seek, but the number of such writes would be pretty reasonable (probably about 10,000 disk seeks).

    Once the data has been subdivided into files of 16-32 million numbers each, simply read each such file into memory and look for duplicates.

    The algorithm as described probably isn't optimal, but it should be reasonably close. Of greatest interest is the fact that cutting in half the number of main passes would cut in half the number of times one had to read through the source data file, but would more than double the time required to process each pass once its data had been copied. I would guess that using 100 passes through the source file probably isn't optimal, but the time required for the overall process using that split factor would be pretty close to the time using the optimal split factor.

    0 讨论(0)
  • 2020-12-16 00:21

    This is an interesting problem.

    First let's do some back of the envelope numbers. Any particular sequence of 20 digits, will match one time in 1020. If we go out to the n'th digit we have roughly n2/2 pairs of 20 digit sequences. So to have good odds of finding a match we're going to probably need to have n a bit above 1010. Assuming that we're taking 40 bytes per record, we're going to need something on the order of 400 GB of data. (We actually need more data than this, so we should be prepared for something over a terabyte of data.)

    That gives us an idea of the needed data volume. Tens of billions of digits. Hundreds of GB of data.

    Now here is the problem. If we use any data structure that requires random access, random access time is set by the disk speed. Suppose that your disk goes at 6000 rpm. That's 100 times per second. On average the data you want is halfway around the disk. So you get 200 random accesses per second on average. (This can vary by hardware.) Accessing it 10 billion times is going to take 50 million seconds, which is over a year. If you read, then write, and wind up needing 20 billion data points - you're exceeding the projected lifetime of your hard drive.

    The alternative is to process a batch of data in a way where you do not access randomly. The classic is to do a good external sort such a merge-sort. Suppose that we have 1 terabyte of data, which we read 30 times, write 30 times, during sorting. (Both estimates are higher than needed, but I'm painting a worst case here.) Suppose our harddrive has a sustained throughput of 100 MB/s. Then each pass takes 10,000 seconds, for 600,000 seconds, which is slightly under a week. This is very doable! (In practice it should be faster than this.)

    So here is the algorithm:

    1. Start with a long list of digits, 3141...
    2. Turn this into a much bigger file where each line is 20 digits, followed by the location where this appears in pi.
    3. Sort this bigger file.
    4. Search the sorted file for any duplications.
      1. If any are found, return the first.
      2. If none are found, repeat steps 1-3 with another big chunk of digits.
      3. Merge this into the previous sorted file.
      4. Repeat this search.

    Now this is great, but what if we don't want to take a week? What if we want to throw multiple machines at it? This turns out to be surprisingly easy. There are well-known distributed sorting algorithms. If we split the initial file into chunks, we can parallelize both steps 1 and 4. And if after step 4 we don't find a match, then we can just repeat from the start with a bigger input chunk.

    In fact this pattern is very common. All that really varies is turning the initial data into stuff to be sorted, and then looking at matching groups. This is the http://en.wikipedia.org/wiki/MapReduce algorithm. And this will work just fine for this problem.

    0 讨论(0)
  • 2020-12-16 00:26

    Perhaps something like this will work:

    1. Search for repeated substrings of length 2 (or some small base case), record starting indicies S={s_i}

    2. For n=3..N, look for substrings of length n from the indices in S

    3. Each iteration, update S with substrings of length n

    4. at n=20, the first two indicies will be your answer

    you might want to adjust the initial size and step size (it might not be necessary to step by 1 each time)

    0 讨论(0)
提交回复
热议问题