Too many open files in EMR

柔情痞子 提交于 2019-11-29 11:23:05

OK, so it seems that the ulimit set by default in Amazon EMR's setup : 32768 is already way too much and if any job needs more than this then one should revisit their logic. Hence, instead of writing every file directly to s3, I wrote them locally and moved to s3 in batches of 1024 files. This solved too many open files issue.

Perhaps when file descriptors were opened up for writing to s3 weren't getting released/closed as it would when written to local files. Any better explanation to this is welcome.

There may be a way to do this via bootstrap actions, specifically one of the predefined ones. And if predefined doesn't work, custom scripts can do anything that you would normally be able to do on any linux cluster. But first I would ask why you're outputting so many files? HDFS/Hadoop is definitely more optimized for fewer larger files. If you're hoping to do some sort of indexing, writing out raw files with different names is probably not the best approach.

I had this issue, but it is a linux setting.

Solve it by going here and follow the steps:

http://www.cyberciti.biz/faq/linux-unix-nginx-too-many-open-files/

I think the correct solution here is to have a single sequence file, the contents of which are each of your binary files, keyed by filename. It's fine to split out records into files, but those files can be stored as blobs, keyed by filename, in one big sequence file.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!