spark error loading files from S3 wildcard

问题

I'm using the pyspark shell and trying to read data from S3 using the file wildcard feature of spark, but I'm getting the following error:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.2.0
      /_/

Using Python version 2.7.6 (default, Jul 24 2015 16:07:07)
SparkContext available as sc.
>>> sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", 'AWS_ACCESS_KEY_ID')
>>> sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", 'AWS_SECRET_ACCESS_KEY')
>>> sc.textFile("s3n://myBucket/path/files-*", use_unicode=False).count()
16/01/07 18:03:02 INFO MemoryStore: ensureFreeSpace(37645) called with curMem=83944, maxMem=278019440
16/01/07 18:03:02 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 36.8 KB, free 265.0 MB)
16/01/07 18:03:02 INFO MemoryStore: ensureFreeSpace(5524) called with curMem=121589, maxMem=278019440
16/01/07 18:03:02 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 5.4 KB, free 265.0 MB)
16/01/07 18:03:02 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on salve1:48235 (size: 5.4 KB, free: 265.1 MB)
16/01/07 18:03:02 INFO BlockManagerMaster: Updated info of block broadcast_2_piece0
16/01/07 18:03:02 INFO SparkContext: Created broadcast 2 from textFile at NativeMethodAccessorImpl.java:-2
16/01/07 18:03:03 WARN RestS3Service: Response '/path' - Unexpected response code 404, expected 200
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/spark-1.2.0-bin-1.0.4/python/pyspark/rdd.py", line 819, in count
    return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File "/spark-1.2.0-bin-1.0.4/python/pyspark/rdd.py", line 810, in sum
    return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
  File "/spark-1.2.0-bin-1.0.4/python/pyspark/rdd.py", line 715, in reduce
    vals = self.mapPartitions(func).collect()
  File "/spark-1.2.0-bin-1.0.4/python/pyspark/rdd.py", line 676, in collect
    bytesInJava = self._jrdd.collect().iterator()
  File "/spark-1.2.0-bin-1.0.4/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
  File "/spark-1.2.0-bin-1.0.4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o65.collect.
: org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: Failed to sanitize XML document destined for handler class org.jets3t.service.impl.rest.XmlResponsesSaxParser$ListBucketHandler
        at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.list(Jets3tNativeFileSystemStore.java:197)
        at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.list(Jets3tNativeFileSystemStore.java:166)
        at sun.reflect.GeneratedMethodAccessor40.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
        at org.apache.hadoop.fs.s3native.$Proxy7.list(Unknown Source)
        at org.apache.hadoop.fs.s3native.NativeS3FileSystem.listStatus(NativeS3FileSystem.java:375)
        at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:842)
        at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:902)
        at org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:1032)
        at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:987)
        at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:177)
        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
        at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
        at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
        at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:57)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1352)
        at org.apache.spark.rdd.RDD.collect(RDD.scala:780)
        at org.apache.spark.api.java.JavaRDDLike$class.collect(JavaRDDLike.scala:309)
        at org.apache.spark.api.java.JavaRDD.collect(JavaRDD.scala:32)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
        at py4j.Gateway.invoke(Gateway.java:259)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:207)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.jets3t.service.S3ServiceException: Failed to sanitize XML document destined for handler class org.jets3t.service.impl.rest.XmlResponsesSaxParser$ListBucketHandler
        at org.jets3t.service.impl.rest.XmlResponsesSaxParser.sanitizeXmlDocument(XmlResponsesSaxParser.java:179)
        at org.jets3t.service.impl.rest.XmlResponsesSaxParser.parseListBucketObjectsResponse(XmlResponsesSaxParser.java:198)
        at org.jets3t.service.impl.rest.httpclient.RestS3Service.listObjectsInternal(RestS3Service.java:1090)
        at org.jets3t.service.impl.rest.httpclient.RestS3Service.listObjectsChunkedImpl(RestS3Service.java:1056)
        at org.jets3t.service.S3Service.listObjectsChunked(S3Service.java:1328)
        at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.list(Jets3tNativeFileSystemStore.java:181)
        ... 44 more
Caused by: java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:3332)
        at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
        at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:569)
        at java.lang.StringBuffer.append(StringBuffer.java:369)
        at org.jets3t.service.impl.rest.XmlResponsesSaxParser.sanitizeXmlDocument(XmlResponsesSaxParser.java:160)
        at org.jets3t.service.impl.rest.XmlResponsesSaxParser.parseListBucketObjectsResponse(XmlResponsesSaxParser.java:198)
        at org.jets3t.service.impl.rest.httpclient.RestS3Service.listObjectsInternal(RestS3Service.java:1090)
        at org.jets3t.service.impl.rest.httpclient.RestS3Service.listObjectsChunkedImpl(RestS3Service.java:1056)
        at org.jets3t.service.S3Service.listObjectsChunked(S3Service.java:1328)
        at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.list(Jets3tNativeFileSystemStore.java:181)
        at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.list(Jets3tNativeFileSystemStore.java:166)
        at sun.reflect.GeneratedMethodAccessor40.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
        at org.apache.hadoop.fs.s3native.$Proxy7.list(Unknown Source)
        at org.apache.hadoop.fs.s3native.NativeS3FileSystem.listStatus(NativeS3FileSystem.java:375)
        at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:842)
        at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:902)
        at org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:1032)
        at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:987)
        at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:177)
        at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
        at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
        at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)

When I try to load a single file (without using the wildcard) the code works. Since I need to read about 100k files, I'm wondering what's the best way to load all of the files into an RDD.

Update

It appears to me that the problem is that the key prefix that I'm using has over 300k files in the s3 "directory" that contains all of my files. My files have the date as a suffix.

s3://myBucket/path/files-2016-01-01-02-00
s3://myBucket/path/files-2016-01-01-02-01
s3://myBucket/path/files-2016-01-01-03-00
s3://myBucket/path/files-2016-01-01-03-01

I was trying to use the wildcard to only select some files by date with s3n://myBucket/path/files-2016-01-01-03-* When I turned on debug logging I saw that spark was listing all of the files in the s3 "directory" (s3://myBucket/path/) rather than only the files with the key prefix that I specified (s3://myBucket/path/files-2016-01-01-03-). So even though I was only trying to read 2 files all 300k files were being listed and this is probably what caused the out of memory.

回答1:

I've listed my files from S3 directly and then made an RDD containing exact file names and it's working so far for me.

raw_file_list = subprocess.Popen("env AWS_ACCESS_KEY_ID="myId" AWS_SECRET_ACCESS_KEY="myKey" aws s3 ls s3://myBucket/path/files-2016-01-01-02", shell=True, stdout=subprocess.PIPE).stdout.read().strip().split('\n')
s3_file_list = sc.parallelize(raw_file_list).map(lambda line: "s3n://myBucket/path/%s" % line.split()[3]).collect()
rdd = sc.textFile(','.join(s3_file_list), use_unicode=False)

回答2:

It is throwing out of memory issue. Hence try to first limit patterns to fewer files and see if it resolves the issue.

回答3:

Spark has a stupid problem loading lots of small files together because it broadcasts some data for each one. This might be fixed in 1.6.0 which was released a few days ago. At the moment I expect your code is loading each file and unioning the RDDs together under-the-hood.

The solution I've used is to move all of the files to load into a single directory on S3 and then pass that as a glob, e.g.: s3n://myBucket/path/input-files/*. This way you're only loading a single path as far as spark is concerned and it won't create a broadcast variable for each file in that path.

来源：https://stackoverflow.com/questions/34662953/spark-error-loading-files-from-s3-wildcard

标签

amazon-s3

apache-spark

wildcard

pyspark