I\'m benchmarking ElasticSearch for very high indexing throughput purposes.
My current goal is to be able to index 3 billion (3,000,000,000) documents in a matter of
Long story short, I ended up with 5 virtual linux machines, 8 cpu, 16 GB, using puppet to deploy elasticsearch. My documents got a little bigger, but so did the throuhgput rate (slightly). I was able to reach 150K index requests / second on average, indexing 1 billion documents in 2 hours. Throughput is not constant, and i observed similar diminishing throughput behavior as before, but to a lesser extent. Since I will be using daily indices for same amount of data, I would expect these performance metrics to be roughly similar every day.
The transition from windows machines to linux was primarily due to convenience and compliance with IT conventions. Though i don't know for sure, I suspect the same results could be achieved on windows as well.
In several of my trials I attempted indexing without specifying document ids as Christian Dahlqvist suggested. The results were astonishing. I observed a significant throughput increase, reaching 300k and higher in some cases. The conclusion of this is obvious: Do not specify document ids, unless you absolutely have to.
Also, i'm using less shards per machine, which also contributed to throughput increase.