numpy float: 10x slower than builtin in arithmetic operations?

前端 未结 8 1057
别跟我提以往
别跟我提以往 2020-12-01 05:00

I am getting really weird timings for the following code:

import numpy as np
s = 0
for i in range(10000000):
    s += np.float64(1) # replace with np.float32         


        
8条回答
  •  天涯浪人
    2020-12-01 05:50

    I can confirm the results also. I tried to see what it would look like using all numpy types, and the difference persists. So then, my tests were:

    def testStandard(length=100000):
        s = 1.0
        addend = 8.0
        modulo = 2399232.0
        startTime = datetime.now()
        for i in xrange(length):
            s = (s + addend) * s % modulo
        return datetime.now() - startTime
    
    def testNumpy(length=100000):
        s = np.float64(1.0)
        addend = np.float64(8.0)
        modulo = np.float64(2399232.0)
        startTime = datetime.now()
        for i in xrange(length):
            s = (s + addend) * s % modulo
        return datetime.now() - startTime
    

    So at this point, the numpy types are all interacting with each other, but the 10x difference persists (2 sec vs 0.2 sec).

    If I had to guess, I would say that there are two possible reasons for why the default float types are much faster. The first possibility is that python performs significant optimizations under the hood for dealing with certain numeric operations or looping in general (e.g. loop unrolling). The second possibility is that the numpy types involves an extra layer of abstraction (i.e. having to read from an address). To look into the effects of each, I did a few extra checks.

    One difference could be the result of python having to take extra steps to resolve the float64 types. Unlike compiled languages that generate efficient tables, python 2.6 (and maybe 3) has a significant cost for resolving things that you'd generally think of as free. Even a simple X.a resolution has to resolve the dot operator EVERY time it is called. (Which is why if you have a loop that calls instance.function() you're better off having a variable "function = instance.function" declared outside the loop).

    From my understanding, when you use python standard operators, these are fairly similar to using the ones from "import operator." If you substitute add, mul, and mod in for your +, *, and %, you see a static performance hit of about 0.5 sec versus the standard operators (to both cases). This means that by wrapping the operators, the standard python float operations get 3x slower. If you do one further, using operator.add and those variants adds on 0.7 sec approximately (over 1m trials, starting with 2 sec and 0.2 sec respectively). That's verging on the 5x slowness. So basically, if each of these issues happens twice, you're basically at the 10x slower point.

    So let's assume we're the python interpreter for a moment. Case 1, we do an operation on native types, let's say a+b. Under the hood, we can check the types of a and b and dispatch our addition to python's optimized code. Case 2, we have an operation of two other types (also a+b). Under the hood, we check if they're native types (they're not). We move on to the 'else' case. The else case sends us to something like a.add(b). a.add can then do a dispatch to numpy's optimized code. So at this point we have had additional overhead of an extra branch, one '.' get slots property, and a function call. And we've only got into the addition operation. We then have to use the result to create a new float64 (or alter an existing float64). Meanwhile, the python native code probably cheats by treating its types specially to avoid this sort of overhead.

    Based on the above examination of the costliness of python function calls and scoping overhead, it would be pretty easy for numpy to incur a 9x penalty just getting to and from its c math functions. I can entirely imagine this process taking many times longer than a simple math operation call. For each operation, the numpy library will have to wade through layers of python to get to its C implementation.

    So in my opinion, the reason for this is probably captured in this effect:

    length = 10000000
    class A():
        X = 10
    startTime = datetime.now()
    for i in xrange(length):
        x = A.X
    print "Long Way", datetime.now() - startTime
    startTime = datetime.now()
    y = A.X
    for i in xrange(length):
        x = y
    print "Short Way", datetime.now() - startTime
    

    This simple case shows a difference of 0.2 sec vs 0.14 sec (short way faster, obviously). I think what you're seeing is mainly just a bunch of those issues adding up.

    To avoid this, I can think of a a couple possible solutions that mainly echo what has been said. The first solution is to try to keep your evaluations inside NumPy as much as possible, as Selinap said. A large amount of the losses are probably due to the interfacing. I would look into ways to dispatch your job into numpy or some other numeric library optimized in C (gmpy has been mentioned). The goal should be to push as much into C at the same time as possible, then get the result(s) back. You want to put in big jobs, not lots of small jobs.

    The second solution, of course, would be to do more of your intermediate and small operations in python if you can. Clearly, using the native objects are going to be faster. They're going to be the first options on all the branch statements and will always have the shortest path to C code. Unless you have a specific need for fixed precision calculation or other issues with the default operators, I don't see why one wouldn't use the straight python functions for many things.

提交回复
热议问题