I realize that reduction is only usable for POD types in C++. What would you do to implement a reduction for a complex type accumulator?
complex
The typical workaround in absence of user-defined reductions in OpenMP is even uglier than what you suggested. Usually, prior to the parallel region people create an array of (at least) as many elements as there will be threads in the region, accumulate partial results separately for each thread using omp_get_thread_num() as an index to the array, and do final reduction of the accumulated results in a loop after the parallel region.
As far as I know, OpenMP language committee works on adding user-defined reductions to the specification, so maybe it will be finally resolved in a few years.