Background of the Problem: I\'m trying to write a puzzle solution algorithm that takes advantage of multi-core processors and parallel processing. However, the ideal/easiest s
The type of problem reminds me of genetic algorithms. You already have a fitness function (the cost) and the layout of the problem seems suited to crossover and mutation. You could use one of the available G.A. engines and run multiple pools/generations in parallel. G.A's tend to find good solutions quite fast, although finding the absolute best solution is not guaranteed. On the other hand I believe the puzzle you describe does not necessarily have a single optimal solution anyway. G.A. solutions are often used for scheduling (for example to create a roster of teachers, classrooms and classes). The solutions found are usually 'robust' in the sense that a reasonable solution catering a change in the constraints can often be found with a minimal number of changes.
As to parallelizing the given recursive algorithm. I tried this recently (using Terracotta) for the n-Queens problem and did something simlar to what you descibe. The first-row queen is placed in each possible column to create n subproblems. There is a pool of worker threads. A job scheduler checks if there is an idle worker thread available in the pool, and assigns it a subproblem. The worker thread works through the subproblem, outputting all found solutions, and returns to idle state. Because there are typically far fewer worker threads than subproblems, it is not a big issue if subproblems don't take equal amounts of time to solve.
I'm curious to hear other ideas.