Originally I believed the overhead to a context-switch was the TLB being flushed. However I just saw on wikipedia:
http://en.wikipedia.org/wiki/Translation_lookaside
If we count in cache invalidation (which we usually should, and which is The Largest Contributor to context switch costs in real-world), performance penalty due to context switch can be HUGE:
https://www.usenix.org/legacy/events/expcs07/papers/2-li.pdf (admittedly a bit outdated, but the best I was able to find) gives it in the range of 100K-1M CPU cycles. Theoretically, in a worst-possible case for a multi-socket server box with 32M L3 per-socket caches consisting out of 64-byte cache lines, completely random access, and typical access times of 40 cycles for L3/100 cycles for main RAM, the penalty can reach as much as 30M+ CPU cycles(!).
From personal experience, I'd say it is usually in the range of tens of K cycles, but depending on specifics, it can differ by an order of magnitude.