What that micro-benchmark shows is that thread creation and joining (there are no fork results when I'm writing this) takes tens or hundreds of microseconds (assuming your system has CLOCKS_PER_SEC=1000000, which it probably has, since it's an XSI requirement).
Since you said that fork() takes 3 times the cost of threads, we are still talking tenths of a millisecond at worst. If that is noticeable on an application, you could use pools of processes/threads, like Apache 1.3 did. In any case, I'd say that startup time is a moot point.
The important difference of threads vs processes (on Linux and most Unix-likes) is that on processes you choose explicitly what to share, using IPC, shared memory (SYSV or mmap-style), pipes, sockets (you can send file descriptors over AF_UNIX sockets, meaning you get to choose which fd's to share), ... While on threads almost everything is shared by default, whether there's a need to share it or not. In fact, that is the reason Plan 9 had rfork() and Linux has clone() (and recently unshare()), so you can choose what to share.