From a logical point of view an application may need dozens or hundreds of threads, some of which will we sleeping most of the time, but a very few will be always running co
I found that when writing data parsers which handle larger sets of data over a network it is best to create a thread for each letter of the alphabet (pertaining to the data) and get the program to be more CPU and memory bound. The I/O boundedness inherit with network and disk operations is a major bottleneck so you may as well "get started" on the other data files instead of doing the work sequentially.
On a quad core, it would certainly make sense to start more than four threads. It is unlikely that those 4 threads would be spread across more than one of the cores, especially with todays processor speeds.