I am evaluating cassandra. I am using the datastax driver and CQL.
I would like to store some data with the following internal structure, where the names are differ
You have a mistake in your code that I think explains a lot of the performance problems you're seeing: for each batch you prepare the statement again. Preparing a statement isn't super expensive, but doing it as you do adds a lot of latency. The time you spend waiting for that statement to be prepared is time you don't build the batch, and time Cassandra doesn't spend processing that batch. A prepared statement only needs to be prepared once and should be re-used.
I think much of the bad performance can be explained latency problems. The bottleneck is most likely your application code, not Cassandra. Even if you only prepare that statement once, you still spend most of the time either being CPU bound in the application (building a big batch) or not doing anything (waiting for the network and Cassandra).
There are two things you can do: first of all use the async API of the CQL driver and build the next batch while the network and Cassandra are busy with the one you just completed; and secondly try running multiple threads doing the same thing. The exact number of threads you'll have to experiment with and will depend on the number of cores you have and if you're running one or three nodes on the same machine.
Running a three node cluster on the same machine makes the cluster slower than running a single node, while running on different machines makes it faster. Also running the application on the same machine doesn't exactly help. If you want to test performance, either run only one node or run a real cluster on separate machines.
Batches can give you extra performance, but not always. They can lead to the kind of problem you're seeing in your test code: buffer bloat. Once batches get too big your application spends too much time building them, then too much time pushing them out on the network, and too much time waiting for Cassandra to process them. You need to experiment with batch sizes and see what works best (but do that with a real cluster, otherwise you won't see the effects of the network, which will be a big factor when your batches get bigger).
And if you use batches, use compression. Compression makes no difference in most request loads (responses are another matter), but when you send huge batches it can make a big difference.
There's nothing special about wide row writes in Cassandra. With some exceptions the schema doesn't change the time it takes to process a write. I run applications that do tens of thousands of non-batched mixed wide-row and non-wide-row writes per second. The clusters aren't big, just three or four m1.xlarge EC2 nodes each. The trick is never to wait for an request to return before sending the next (that doesn't mean fire and forget, just handle the responses in the same asynchronous manner). Latency is a performance killer.