While writing to S3, why I get FileNotFoundException

问题

I'm using Spark-SQL-2.3.1, Kafka, Java 8 in my project, and would like to use AWS-S3 as savage storage.

I am writing/storing the consumed data from Kafka topic into S3 bucket as below:

   ds.writeStream()
     .format("parquet")
     .option("path", parquetFileName)
     .option("mergeSchema", true)
     .outputMode("append")
     .partitionBy("company_id")
     .option("checkpointLocation", checkPtLocation)
     .trigger(Trigger.ProcessingTime("25 seconds"))
     .start();

But while writing I am getting a FileNotFoundException

Caused by: java.io.FileNotFoundException: No such file or directory: s3a://company_id=216231245/part-00055-f4f87dc9-a620-41bd-9380-de4ba7e70efb.c000.snappy.parquet
  at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:1931)
  at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:1822)
  at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:1763)

I wounder why I'm getting FileNotFoundException when writing? i am not reading from S3 right? So what is happening here and how fix this?

回答1:

This is because S3 is not a file system, but an object store. It does not support the semantics required for rename like HDFS. Spark first writes the output files to temporary folder and then rename them. There is no atomic way of doing this in S3. That's why at times, you will see these errors.

Now, to fix this, if your environment allows, you could use HDFS as an intermediate storage and move the files to S3 for later processing.

If you are on hadoop 3.1, you could use s3a committers shipped with it. More details on how to configure this can be found here

If you are on older version of hadoop, you could use an S3 output committer for Spark, which basically uses S3's multi-part upload to mimic this rename. One such committer I am aware of is this. Looks like this is not updated recently though. There may be other options too.

来源：https://stackoverflow.com/questions/60201672/while-writing-to-s3-why-i-get-filenotfoundexception

标签

apache-spark

amazon-s3

apache-spark-sql

spark-structured-streaming