I have a piece of mature geospatial software that has recently had areas rewritten to take better advantage of the multiple processors available in modern PCs. Specifically
Firstly, many thanks for the responses. For the responses posted across different forumes see;
http://www.sqaforums.com/showflat.php?Cat=0&Number=617621&an=0&page=0#Post617621
Testing approach for multi-threaded software
http://www.softwaretestingclub.com/forum/topics/testing-approach-for?xg_source=activity
and the following mailing list; software-testing@yahoogroups.com
The testing took significantly longer than expected, hence this late reply, leading me to the conclusion that adding multi-threading to existing apps is liable to be very expensive in terms of testing, even if the coding is quite straightforward. This could prove interesting for the SQA community, as there is increasingly more multi-threaded development going on out there.
As per Joe Strazzere's advice, I found the most effective way of hitting bugs was via automation with varied input. I ended up doing this on three PCs, which have ran a bank of tests repeatedly with varied input over about six weeks. Initially, I was seeing crashes one or two times per PC per day. As I tracked these down, it ended up with one or two per week between the three PCs, and we haven't had any further problems for the last two weeks. For the last two weeks we have also had a version with users beta testing, and are using the software in-house.
In addition to varying the input under automation, I also got good results from the following;
Adding a test option that allowed mutex time-outs to be read from a configuration file, which in turn could be controlled by my automation.
Extending mutex time-outs beyond the typical time expected to execute a section of thread code, and firing a debug exception on time-out.
Running the automation in conjunction with a debugger (VS2008) such that when a problem occurred there was a better chance of tracking it down.
Running without a debugger to ensure that the debugger was not hiding other timing related bugs.
Running the automation against normal release, debug, and fully optimised build. FWIW, the optimised build threw up errors not reproducible in the other builds.
The type of bugs uncovered tended to be serious in nature, e.g. dereferencing invalid pointers, and even under the debugger took quite a bit of tracking down. As has been discussed elsewhere, the SuspendThread and ResumeThread functions ended up being major culprits, and all use of these functions were replaced by mutexes. Similarly all critical sections were removed due to lack of time-outs. Closing documents and exiting the program were also a bug source, where in one instance a document was destroyed with a worker thread still active. To overcome this a single mutex was added per thread to control the life of the thread, and aquired by the document destructor to ensure the thread had terminated as expected.
Once again, many thanks for the all the detailed and varied responses. Next time I take on this type of activity, I'll be better prepared.