I was trying to implement a Miller-Rabin primality test, and was puzzled why it was taking so long (> 20 seconds) for midsize numbers (~7 digits). I eventually found the fol
BrenBarn answered your main question. For your aside:
why is it almost twice as fast when run with Python 2 or 3 than PyPy, when usually PyPy is much faster?
If you read PyPy's performance page, this is exactly the kind of thing PyPy is not good at—in fact, the very first example they give:
Bad examples include doing computations with large longs – which is performed by unoptimizable support code.
Theoretically, turning a huge exponentiation followed by a mod into a modular exponentiation (at least after the first pass) is a transformation a JIT might be able to make… but not PyPy's JIT.
As a side note, if you need to do calculations with huge integers, you may want to look at third-party modules like gmpy, which can sometimes be much faster than CPython's native implementation in some cases outside the mainstream uses, and also has a lot of additional functionality that you'd otherwise have to write yourself, at the cost of being less convenient.