Too many open files in EMR

I am getting the following excpetion in my reducers:

EMFILE: Too many open files
    at org.apache.hadoop.io.nativeio.NativeIO.open(Native Method)
    at org.apache.hadoop.io.SecureIOUtils.createForWrite(SecureIOUtils.java:161)
    at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:296)
    at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:369)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:257)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
    at org.apache.hadoop.mapred.Child.main(Child.java:249)

Per reducer around 10,000 files are being created. Is there a way I can set the ulimit of each box.

I tried using the following command as a bootstrap script: ulimit -n 1000000

But this did not help at all.

I also tried the following in bootstrap action to replace the ulimit command in /usr/lib/hadoop/hadoop-daemon.sh:

#!/bin/bash
set -e -x
sudo sed -i -e "/^ulimit /s|.*|ulimit -n 134217728|" /usr/lib/hadoop/hadoop-daemon.sh

But even then when we log into master node I can see that ulimit -n returns : 32768. I also confirmed that there was the desired change made in /usr/lib/hadoop/hadoop-daemon.sh and it had : ulimit -n 134217728.

Do we have any hadoop configurations for this? Or is there a workaround for this?

My main aim is to split out records into files according to the ids of each record, and there are 1.5 billion records right now which can certainly increase.

Any way to edit this file before this daemon is run on each slave?

OK, so it seems that the ulimit set by default in Amazon EMR's setup : 32768 is already way too much and if any job needs more than this then one should revisit their logic. Hence, instead of writing every file directly to s3, I wrote them locally and moved to s3 in batches of 1024 files. This solved too many open files issue.

Perhaps when file descriptors were opened up for writing to s3 weren't getting released/closed as it would when written to local files. Any better explanation to this is welcome.

There may be a way to do this via bootstrap actions, specifically one of the predefined ones. And if predefined doesn't work, custom scripts can do anything that you would normally be able to do on any linux cluster. But first I would ask why you're outputting so many files? HDFS/Hadoop is definitely more optimized for fewer larger files. If you're hoping to do some sort of indexing, writing out raw files with different names is probably not the best approach.

I had this issue, but it is a linux setting.

Solve it by going here and follow the steps:

http://www.cyberciti.biz/faq/linux-unix-nginx-too-many-open-files/

I think the correct solution here is to have a single sequence file, the contents of which are each of your binary files, keyed by filename. It's fine to split out records into files, but those files can be stored as blobs, keyed by filename, in one big sequence file.

来源：https://stackoverflow.com/questions/12953251/too-many-open-files-in-emr

标签

Hadoop

MapReduce

elastic-map-reduce

emr