Why is checking isinstance(something, Mapping) so slow?

问题

I recently compared the performance of collections.Counter to sorted for comparison checks (if some iterable contains the same elements with the same amount) and while the big-iterable performance of Counter is generally better than sorted it's much slower for short iterables.

Using line_profiler the bottleneck seems to be the isinstance(iterable, collections.Mapping)-check in Counter.update:

%load_ext line_profiler  # IPython
lst = list(range(1000))
%lprun -f Counter.update Counter(lst)

gives me:

Timer unit: 5.58547e-07 s

Total time: 0.000244643 s
File: ...\lib\collections\__init__.py
Function: update at line 581

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   581                                               def update(*args, **kwds):
   601         1            8      8.0      1.8          if not args:
   602                                                       raise TypeError("descriptor 'update' of 'Counter' object "
   603                                                                       "needs an argument")
   604         1           12     12.0      2.7          self, *args = args
   605         1            6      6.0      1.4          if len(args) > 1:
   606                                                       raise TypeError('expected at most 1 arguments, got %d' % len(args))
   607         1            5      5.0      1.1          iterable = args[0] if args else None
   608         1            4      4.0      0.9          if iterable is not None:
   609         1           72     72.0     16.4              if isinstance(iterable, Mapping):
   610                                                           if self:
   611                                                               self_get = self.get
   612                                                               for elem, count in iterable.items():
   613                                                                   self[elem] = count + self_get(elem, 0)
   614                                                           else:
   615                                                               super(Counter, self).update(iterable) # fast path when counter is empty
   616                                                       else:
   617         1          326    326.0     74.4                  _count_elements(self, iterable)
   618         1            5      5.0      1.1          if kwds:
   619                                                       self.update(kwds)

So even for length 1000 iterables it takes more than 15% of the time. For even shorter iterables (for example 20 items it increases to 60%).

I first thought it has something to do with how collections.Mapping uses __subclasshook__ but that method isn't called after the first isinstance-check anymore. So why is checking isinstance(iterable, Mapping) so slow?

回答1:

The performance is really just tied to a collection of checks in ABCMeta's __instancecheck__, which is called by isinstance.

The bottom line is that the poor performance witnessed here isn't a result of some missing optimization, but rather just a result of isinstance with abstract base classes being a Python-level operation, as mentioned by Jim. Positive and negative results are cached, but even with cached results you're looking at a few microseconds per loop simply to traverse the conditionals in the __instancecheck__ method of the ABCMeta class.

An example

Consider some different empty structures.

>>> d = dict; l = list(); s = pd.Series()

>>> %timeit isinstance(d, collections.abc.Mapping)
100000 loops, best of 3: 1.99 µs per loop

>>> %timeit isinstance(l, collections.abc.Mapping)
100000 loops, best of 3: 3.16 µs per loop # caching happening

>>> %timeit isinstance(s, collections.abc.Mapping)
100000 loops, best of 3: 3.26 µs per loop # caching happening

We can see the performance discrepancy - what accounts for it?

For a dict

>>> %lprun -f abc.ABCMeta.__instancecheck__ isinstance(dict(), collections.abc.Mapping)
Timer unit: 6.84247e-07 s
Total time: 1.71062e-05 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   178                                               def __instancecheck__(cls, instance):
   179                                                   """Override for isinstance(instance, cls)."""
   180                                                   # Inline the cache checking
   181         1            7      7.0     28.0          subclass = instance.__class__
   182         1           16     16.0     64.0          if subclass in cls._abc_cache:
   183         1            2      2.0      8.0              return True
   184                                                   subtype = type(instance)
   185                                                   if subtype is subclass:
   186                                                       if (cls._abc_negative_cache_version ==
   187                                                           ABCMeta._abc_invalidation_counter and
   188                                                           subclass in cls._abc_negative_cache):
   189                                                           return False
   190                                                       # Fall back to the subclass check.
   191                                                       return cls.__subclasscheck__(subclass)
   192                                                   return any(cls.__subclasscheck__(c) for c in {subclass, subtype})

For a list

>>> %lprun -f abc.ABCMeta.__instancecheck__ isinstance(list(), collections.abc.Mapping)
Timer unit: 6.84247e-07 s
Total time: 3.07911e-05 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   178                                               def __instancecheck__(cls, instance):
   179                                                   """Override for isinstance(instance, cls)."""
   180                                                   # Inline the cache checking
   181         1            7      7.0     15.6          subclass = instance.__class__
   182         1           17     17.0     37.8          if subclass in cls._abc_cache:
   183                                                       return True
   184         1            2      2.0      4.4          subtype = type(instance)
   185         1            2      2.0      4.4          if subtype is subclass:
   186         1            3      3.0      6.7              if (cls._abc_negative_cache_version ==
   187         1            2      2.0      4.4                  ABCMeta._abc_invalidation_counter and
   188         1           10     10.0     22.2                  subclass in cls._abc_negative_cache):
   189         1            2      2.0      4.4                  return False
   190                                                       # Fall back to the subclass check.
   191                                                       return cls.__subclasscheck__(subclass)
   192                                                   return any(cls.__subclasscheck__(c) for c in {subclass, subtype})

We can see that for a dict, the Mapping abstract classes' _abc_cache

>>> list(collections.abc.Mapping._abc_cache)
[dict]

includes our dict, and so the check short-circuits early. For a list evidently the positive cache won't be hit, however the Mapping's _abc_negative_cache contains the list type

>>> list(collections.abc.Mapping._abc_negative_cache)
[type,
 list,
 generator,
 pandas.core.series.Series,
 itertools.chain,
 int,
 map]

as well as now the pd.Series type, as a result of calling isinstance more than once with %timeit. In the case that we don't hit the negative cache (like the first iteration for a Series), Python resorts to the regular subclass check with

cls.__subclasscheck__(subclass)

which can be far slower, resorting to the subclass hook and recursive subclass checks seen here, then caches the result for subsequent speedups.

来源：https://stackoverflow.com/questions/42378726/why-is-checking-isinstancesomething-mapping-so-slow

标签

python

python-internals