AWS EMR performance HDFS vs S3

六月ゝ 毕业季﹏ 提交于 2019-12-04 08:54:11

That's problematic on a different level.

S3 has only eventual consistency. You don't immediately see/can read after something was written by your code (e.g. a close() or flush()) , as the write process is delayed. I think this might be due to the allocation of free resources for the data you write. So it is not a problem of performance, but of the consistency you really want/need.

What do I do on EMR? I startup a Hadoop cluster and put everything into HDFS what is needed by the job(s). Reads are much more expensive in time on S3 and the eventual consistency makes ist basically useless for buffering items between jobs.

However S3 is great when backing up files from your HDFS or making them available for other instances or services (e.g. CloudFront).

In terms of performance HDFS is better than S3

HDFS is better if your requirement is long term, requires high performance and you want to execute iterative machine learning algorithms

S3 is better if your load is variable, requires high durability and persistence with less cost.

For more information visit this link http://www.nithinkanil.com/2015/05/hdfs-vs-s3.html

You must use S3 if you want to terminate the EMR cluster, because once you terminate the cluster - HDFS data will be deleted.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!