I have to write a not-so-large program in C++, using boost::thread.
The problem at hand, is to process a large (maybe thousands or tens of thousands. Hundreds and millon
How expensive the simplest thread is depends on the OS (you may also need to tune some OS parameters to get past a certain number of threads). At minimum each has its own CPU state (registers/flags incl. floating point) and stack as well as any thread-specific heap storage.
If each individual thread doesn't need too much distinct state, then you can probably get them pretty cheap by using a small stack size.
In the limit, you may end up needing to use a non-OS cooperative threading mechanism, or even multiplex events yourself using tiny "execution context' objects.
Just start with threads and worry about it later :)