Storing Python objects in a Python list vs. a fixed-length Numpy array

前端 未结 1 674
爱一瞬间的悲伤
爱一瞬间的悲伤 2020-12-11 16:30

In doing some bioinformatics work, I\'ve been pondering the ramifications of storing object instances in a Numpy array rather than a Python list, but in all the testing I\'v

相关标签:
1条回答
  • 2020-12-11 17:02

    Don't use object arrays in numpy for things like this.

    They defeat the basic purpose of a numpy array, and while they're useful in a tiny handful of situations, they're almost always a poor choice.

    Yes, accessing an individual element of a numpy array in python or iterating through a numpy array in python is slower than the equivalent operation with a list. (Which is why you should never do something like y = [item * 2 for item in x] when x is a numpy array.)

    Numpy object arrays will have a slightly lower memory overhead than a list, but if you're storing that many individual python objects, you're going to run into other memory problems first.

    Numpy is first and foremost a memory-efficient, multidimensional array container for uniform numerical data. If you want to hold arbitrary objects in a numpy array, you probably want a list, instead.


    My point is that if you want to use numpy effectively, you may need to re-think how you're structuring things.

    Instead of storing each object instance in a numpy array, store your numerical data in a numpy array, and if you need separate objects for each row/column/whatever, store an index into that array in each instance.

    This way you can operate on the numerical arrays quickly (i.e. using numpy instead of list comprehensions).

    As a quick example of what I'm talking about, here's a trivial example without using numpy:

    from random import random
    
    class PointSet(object):
        def __init__(self, numpoints):
            self.points = [Point(random(), random()) for _ in xrange(numpoints)]
    
        def update(self):
            for point in self.points:
                point.x += random() - 0.5
                point.y += random() - 0.5
    
    class Point(object):
        def __init__(self, x, y):
            self.x = x
            self.y = y
    
    points = PointSet(100000)
    point = points.points[10]
    
    for _ in xrange(1000):
        points.update()
        print 'Position of one point out of 100000:', point.x, point.y
    

    And a similar example using numpy arrays:

    import numpy as np
    
    class PointSet(object):
        def __init__(self, numpoints):
            self.coords = np.random.random((numpoints, 2))
            self.points = [Point(i, self.coords) for i in xrange(numpoints)]
    
        def update(self):
            """Update along a random walk."""
            # The "+=" is crucial here... We have to update "coords" in-place, in
            # this case. 
            self.coords += np.random.random(self.coords.shape) - 0.5
    
    class Point(object):
        def __init__(self, i, coords):
            self.i = i
            self.coords = coords
    
        @property
        def x(self):
            return self.coords[self.i,0]
    
        @property
        def y(self):
            return self.coords[self.i,1]
    
    
    points = PointSet(100000)
    point = points.points[10]
    
    for _ in xrange(1000):
        points.update()
        print 'Position of one point out of 100000:', point.x, point.y
    

    There are other ways to do this (you may want to avoid storing a reference to a specific numpy array in each point, for example), but I hope it's a useful example.

    Note the difference in speed at which they run. On my machine, it's a difference of 5 seconds for the numpy version vs 60 seconds for the pure-python version.

    0 讨论(0)
提交回复
热议问题