Unexpected Behavior of itertools.groupby

血红的双手。 提交于 2019-12-03 15:25:10

itertools.groupby documentation tells that

itertools.groupby(iterable, key=None)

[...]

The operation of groupby() is similar to the uniq filter in Unix. It generates a break or new group every time the value of the key function changes (which is why it is usually necessary to have sorted the data using the same key function). That behavior differs from SQL’s GROUP BY which aggregates common elements regardless of their input order.

The returned group is itself an iterator that shares the underlying iterable with groupby(). Because the source is shared, when the `groupby() object is advanced, the previous group is no longer visible. So, if that data is needed later, it should be stored as a list [--]

So the assumption from the last paragraph is that that the generated list would be the empty list [], since the iterator advanced already, and met StopIteration; but instead in CPython the result is surprising [9].


This is because the _grouper iterator lags one item behind the original iterator, which is because groupby needs to peek one item ahead to see if it belongs to the current or the next group, yet it must be able to later yield this item as the first item of the new group.

However the currkey and currvalue attributes of the groupby are not reset when the original iterator is exhausted, so currvalue still points to the last item from the iterator.

The CPython documentation actually contains this equivalent code, that also has the exact same behaviour as the C version code:

class groupby:
    # [k for k, g in groupby('AAAABBBCCDAABBB')] --> A B C D A B
    # [list(g) for k, g in groupby('AAAABBBCCD')] --> AAAA BBB CC D
    def __init__(self, iterable, key=None):
        if key is None:
            key = lambda x: x
        self.keyfunc = key
        self.it = iter(iterable)
        self.tgtkey = self.currkey = self.currvalue = object()
    def __iter__(self):
        return self
    def __next__(self):
        while self.currkey == self.tgtkey:
            self.currvalue = next(self.it)    # Exit on StopIteration
            self.currkey = self.keyfunc(self.currvalue)
        self.tgtkey = self.currkey
        return (self.currkey, self._grouper(self.tgtkey))
    def _grouper(self, tgtkey):
        while self.currkey == tgtkey:
            yield self.currvalue
            try:
                self.currvalue = next(self.it)
            except StopIteration:
                return
            self.currkey = self.keyfunc(self.currvalue)

Notably the __next__ finds the first item of the next group, and stores it its key into self.currkey and its value to self.currvalue. But the key is the line

self.currvalue = next(self.it)    # Exit on StopIteration

When next throws StopItertion the self.currvalue still contains the last key of the previous group. Now, when y[1] is made into a list, it first yields the value of self.currvalue, and only then runs next() on the underlying iterator (and meets StopIteration again).


Even though there is Python equivalent in the documentation, that behaves exactly like the authoritative C code implementation in CPython, IronPython, Jython and PyPy give different results.

The problem is that you group all of them into one group so after the first next call everything is already grouped:

import itertools
x = itertools.groupby(range(10), lambda x: True)
key, elements = next(x)

but the elements are a generator, so you need to pass it immediatly into some structure taking an iterable to "print" or "save" it, i.e. a list:

print('Key: "{}" with value "{}"'.format(key, list(elements)))

and then your range(10) is empty and the groupy-generator is finished:

Key: True with value [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!