I want to encrypt/decrypt lots of small (2-10kB) pieces of data. The performance is ok for now: On a Core2Duo, I get about 90 MBytes/s AES256 (when using 2 threads). But I m
A simple google search will identify some JCE providers which claim hardware acceleration Solaris Crypto Framework. I have heard the break-even point is 4K (where under 4k its faster to perform using in JVM java providers).
I might look at using the NSS implementation, it might have some compiler optimizations for your platform (and you can certainly build from source with them enabled); though I have not used it myself. The big benefit with hardware a provider is probably the fact that the keys can be stored in hardware in a way that supports using them without exposing them to the OS.
Update: I should probably mention that the Keyczar source had some helpful insight (somewhere in source or surrounding docs) about reducing the overhead for initializing the Cipher. It also does exactly what you want (see Encrypter), and seems to implement asynchronous encryption (using a thread pool).