Given a string of a million numbers, return all repeating 3 digit numbers

前端 未结 13 1548
误落风尘
误落风尘 2020-12-22 15:41

I had an interview with a hedge fund company in New York a few months ago and unfortunately, I did not get the internship offer as a data/software engineer. (They also asked

13条回答
  •  没有蜡笔的小新
    2020-12-22 16:11

    Here is a NumPy implementation of the "consensus" O(n) algorithm: walk through all triplets and bin as you go. The binning is done by upon encountering say "385", adding one to bin[3, 8, 5] which is an O(1) operation. Bins are arranged in a 10x10x10 cube. As the binning is fully vectorized there is no loop in the code.

    def setup_data(n):
        import random
        digits = "0123456789"
        return dict(text = ''.join(random.choice(digits) for i in range(n)))
    
    def f_np(text):
        # Get the data into NumPy
        import numpy as np
        a = np.frombuffer(bytes(text, 'utf8'), dtype=np.uint8) - ord('0')
        # Rolling triplets
        a3 = np.lib.stride_tricks.as_strided(a, (3, a.size-2), 2*a.strides)
    
        bins = np.zeros((10, 10, 10), dtype=int)
        # Next line performs O(n) binning
        np.add.at(bins, tuple(a3), 1)
        # Filtering is left as an exercise
        return bins.ravel()
    
    def f_py(text):
        counts = [0] * 1000
        for idx in range(len(text)-2):
            counts[int(text[idx:idx+3])] += 1
        return counts
    
    import numpy as np
    import types
    from timeit import timeit
    for n in (10, 1000, 1000000):
        data = setup_data(n)
        ref = f_np(**data)
        print(f'n = {n}')
        for name, func in list(globals().items()):
            if not name.startswith('f_') or not isinstance(func, types.FunctionType):
                continue
            try:
                assert np.all(ref == func(**data))
                print("{:16s}{:16.8f} ms".format(name[2:], timeit(
                    'f(**data)', globals={'f':func, 'data':data}, number=10)*100))
            except:
                print("{:16s} apparently crashed".format(name[2:]))
    

    Unsurprisingly, NumPy is a bit faster than @Daniel's pure Python solution on large data sets. Sample output:

    # n = 10
    # np                    0.03481400 ms
    # py                    0.00669330 ms
    # n = 1000
    # np                    0.11215360 ms
    # py                    0.34836530 ms
    # n = 1000000
    # np                   82.46765980 ms
    # py                  360.51235450 ms
    

提交回复
热议问题