Why can itertools.groupby group the NaNs in lists but not in numpy arrays

☆樱花仙子☆ 提交于 2019-11-29 16:58:31

问题


I'm having a difficult time to debug a problem in which the float nan in a list and nan in a numpy.array are handled differently when these are used in itertools.groupby:

Given the following list and array:

from itertools import groupby
import numpy as np

lst = [np.nan, np.nan, np.nan, 0.16, 1, 0.16, 0.9999, 0.0001, 0.16, 0.101, np.nan, 0.16]
arr = np.array(lst)

When I iterate over the list the contiguous nans are grouped:

>>> for key, group in groupby(lst):
...     if np.isnan(key):
...         print(key, list(group), type(key))
nan [nan, nan, nan] <class 'float'>
nan [nan] <class 'float'>

However if I use the array it puts successive nans in different groups:

>>> for key, group in groupby(arr):
...     if np.isnan(key):
...         print(key, list(group), type(key))
nan [nan] <class 'numpy.float64'>
nan [nan] <class 'numpy.float64'>
nan [nan] <class 'numpy.float64'>
nan [nan] <class 'numpy.float64'>

Even if I convert the array back to a list:

>>> for key, group in groupby(arr.tolist()):
...     if np.isnan(key):
...         print(key, list(group), type(key))
nan [nan] <class 'float'>
nan [nan] <class 'float'>
nan [nan] <class 'float'>
nan [nan] <class 'float'>

I'm using:

numpy 1.11.3
python 3.5

I know that generally nan != nan so why do these operations give different results? And how is it possible that groupby can group nans at all?


回答1:


Python lists are just arrays of pointers to objects in memory. In particular lst holds pointers to the object np.nan:

>>> [id(x) for x in lst]
[139832272211880, # nan
 139832272211880, # nan
 139832272211880, # nan
 139832133974296,
 139832270325408,
 139832133974296,
 139832133974464,
 139832133974320,
 139832133974296,
 139832133974440,
 139832272211880, # nan
 139832133974296]

(np.nan is at 139832272211880 on my computer.)

On the other hand, NumPy arrays are just contiguous regions of memory; they are regions of bits and bytes that are interpreted as a sequence of values (floats, ints, etc.) by NumPy.

The trouble is that when you ask Python to iterate over a NumPy array holding floating values (at a for-loop or groupby level), Python needs to box these bytes into a proper Python object. It creates a brand new Python object in memory for each single value in the array as it iterates.

For example, you can see that that distinct objects for each nan value are created when .tolist() is called:

>>> [id(x) for x in arr.tolist()]
[4355054616, # nan
 4355054640, # nan
 4355054664, # nan
 4355054688,
 4355054712,
 4355054736,
 4355054760,
 4355054784,
 4355054808,
 4355054832,
 4355054856, # nan
 4355054880]

itertools.groupby is able to group on np.nan for the Python list because it checks for identity first when it compares Python objects. Because these pointers to nan all point at the same np.nan object, grouping is possible.

However, iteration over the NumPy array does not allow this initial identity check to succeed, so Python falls back to checking for equality and nan != nan as you say.




回答2:


The answers of tobias_k and ajcr are correct, it's because the nans in the list have the same id while they have different ids when they are "iterated over" in the numpy-array.

This answer is meant as a supplement for these answers.

>>> from itertools import groupby
>>> import numpy as np

>>> lst = [np.nan, np.nan, np.nan, 0.16, 1, 0.16, 0.9999, 0.0001, 0.16, 0.101, np.nan, 0.16]
>>> arr = np.array(lst)

>>> for key, group in groupby(lst):
...     if np.isnan(key):
...         print(key, id(key), [id(item) for item in group])
nan 1274500321192 [1274500321192, 1274500321192, 1274500321192]
nan 1274500321192 [1274500321192]

>>> for key, group in groupby(arr):
...     if np.isnan(key):
...         print(key, id(key), [id(item) for item in group])
nan 1274537130480 [1274537130480]
nan 1274537130504 [1274537130504]
nan 1274537130480 [1274537130480]
nan 1274537130480 [1274537130480]  # same id as before but these are not consecutive

>>> for key, group in groupby(arr.tolist()):
...     if np.isnan(key):
...         print(key, id(key), [id(item) for item in group])
nan 1274537130336 [1274537130336]
nan 1274537130408 [1274537130408]
nan 1274500320904 [1274500320904]
nan 1274537130168 [1274537130168]

The problem is that Python uses the PyObject_RichCompare-operation when comparing values, which only tests for object identity if == fails because it's not implemented. itertools.groupby on the other hand uses PyObject_RichCompareBool (see Source: 1, 2) which tests for object identity first and before == is tested.

This can be verified with a small cython snippet:

%load_ext cython
%%cython

from cpython.object cimport PyObject_RichCompareBool, PyObject_RichCompare, Py_EQ

def compare(a, b):
    return PyObject_RichCompare(a, b, Py_EQ), PyObject_RichCompareBool(a, b, Py_EQ)

>>> compare(np.nan, np.nan)
(False, True)

The source code for PyObject_RichCompareBool reads like this:

/* Perform a rich comparison with object result.  This wraps do_richcompare()
   with a check for NULL arguments and a recursion check. */

/* Perform a rich comparison with integer result.  This wraps
   PyObject_RichCompare(), returning -1 for error, 0 for false, 1 for true. */
int
PyObject_RichCompareBool(PyObject *v, PyObject *w, int op)
{
    PyObject *res;
    int ok;

    /* Quick result when objects are the same.
       Guarantees that identity implies equality. */
    /**********************That's the difference!****************/
    if (v == w) {
        if (op == Py_EQ)
            return 1;
        else if (op == Py_NE)
            return 0;
    }

    res = PyObject_RichCompare(v, w, op);
    if (res == NULL)
        return -1;
    if (PyBool_Check(res))
        ok = (res == Py_True);
    else
        ok = PyObject_IsTrue(res);
    Py_DECREF(res);
    return ok;
}

The object identity test (if (v == w)) is indeed done before the normal python comparison PyObject_RichCompare(v, w, op); is used and mentioned in its documentation:

Note :

If o1 and o2 are the same object, PyObject_RichCompareBool() will always return 1 for Py_EQ and 0 for Py_NE.




回答3:


I am not sure whether this is the reason, but I just noticed this about the nan in lst and arr:

>>> lst[0] == lst[1], arr[0] == arr[1]
(False, False)
>>> lst[0] is lst[1], arr[0] is arr[1]
(True, False)

I.e., while all nan are inequal, the regular np.nan (of type float) are all the same instance, while the nan in the arr are different instances of type numpy.float64). So my guess would be that if no key function is given, groupby will test for identity before doing the more expensive equality check.

This is also consistent with the observation that is does not group in arr.tolist() either, because even though those nan are now float again, they are no longer the same instance.

>>> atl = arr.tolist()
>>> atl[0] is atl[1]
False


来源:https://stackoverflow.com/questions/41723419/why-can-itertools-groupby-group-the-nans-in-lists-but-not-in-numpy-arrays

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!