for (int x = 0; x < blockCountX; x++)
{
for (int y = 0; y < blockCountY; y++)
Why do you believe that breaking up the large image into smaller chunks will be more efficient? Is the large image too large to fit into system memory? 4million pixels x 8bpp (1 byte per pixel) = 4 megabytes. This was a lot of memory 20 years ago. Today it's chump change.
Creating multiple 256x256 sub-images will require copying the pixel data into new images in memory, plus the image header/descriptor overhead for each new image, plus alignment padding per scanline. You will more than double your memory use, which can create performance problems (virtual swapping) itself.
You are also spinning up a new thread for each image block. Allocating a thread is very expensive, and may take more time than the work you want the thread to do. Consider at least using ThreadPool.QueueUserWorkItem to make use of already available system worker threads. Using .NET 4.0's Task class would be even better, IMO.
Forget .GetPixel(). It's a thousand times slower than pixel memory access.
If you want to distribute processing the image pixels across multiple CPU cores, consider processing each scanline or group of scanlines to a different task or worker thread.