问题
I am working on a four node multi cluster in hadoop. I have run a series of experiments with the block sizes as follows and calculated run time as follows.
All of them are performed on 20GB input file. 64MB - 32 min, 128MB - 19 Min, 256MB - 15 min, 1GB - 12.5 min.
Should I proceed further in going for 2GB block size? Also kindly explain an optimal block size if similar operations are performed on 90GB file. Thanks!
回答1:
You should test with 2Gb and compare results.
Only you consider the next: More biggest block size minimize the overhead of create maps tasks, but for non-local tasks, Hadoop need transfer all the block to the remote node (network bandwidth limit here), then more smallest block size perform better here.
In your case, 4 nodes (I assume connected by a switch or router local in a LAN), 2Gb isn't a problem. But the answer isn't true in others enviroments, which more error rate.
来源:https://stackoverflow.com/questions/28145178/optimal-block-size-for-a-hadoop-cluster