apache spark: Read large size files from a directory

问题

I am reading each file of a directory using wholeTextFiles. After that I am calling a function on each element of the rdd using map . The whole program uses just 50 lines of each file. The code is as below:

def processFiles(fileNameContentsPair):
  fileName= fileNameContentsPair[0]
  result = "\n\n"+fileName
  resultEr = "\n\n"+fileName
  input = StringIO.StringIO(fileNameContentsPair[1])
  reader = csv.reader(input,strict=True)

  try:
       i=0
       for row in reader:
         if i==50:
           break
         // do some processing and get result string
         i=i+1
  except csv.Error as e:
    resultEr = resultEr +"error occured\n\n"
    return resultEr
  return result



if __name__ == "__main__":
  inputFile = sys.argv[1]
  outputFile = sys.argv[2]
  sc = SparkContext(appName = "SomeApp")
  resultRDD = sc.wholeTextFiles(inputFile).map(processFiles)
  resultRDD.saveAsTextFile(outputFile)

The size of each file of the directory can be very large in my case and because of this reason use of wholeTextFiles api will be inefficient in this case. is there any efficient way to do this ? I can think of iterating over each file of the directory one by one but that also seems to be inefficient. I am new to spark. Please let me know if there is any efficient way to do this.

回答1:

Okay what I would suggest is to split your files first into smaller chunks, a few Gbs is too large to read which is the main cause of your delay. If your data is on HDFS, you could have something like 64MB for each file. Otherwise you should experiment with the file size because it depends on the number of executors that you have. So if you have more smaller chunks, you could increase this to have more parallelism. Likewise you can also increase your partition to tune it as your processFiles function does not seem to be CPU intensive. The only problem with a lot of executors is that I/O increases but if file size is small that shouldn't be much of the problem.

By the way, there is no need for a temp directory, wholeTextFilessupports wildcards like *. Also note if you use S3 as a filesystem, there might be a bottleneck if you have too many small files as reading can take up a while instead of a large file. So this is not trivial.

Hope this helps!

来源：https://stackoverflow.com/questions/43845220/apache-spark-read-large-size-files-from-a-directory

标签

apache-spark

pyspark

HDFS