Multiple spark jobs appending parquet data to same base path with partitioning

后端 未结 4 866
粉色の甜心
粉色の甜心 2020-12-08 01:07

I have multiple jobs that I want to execute in parallel that append daily data into the same path using partitioning.

e.g.

dataFrame.write().
             


        
相关标签:
4条回答
  • 2020-12-08 01:26

    Instead of using partitionBy

    dataFrame.write().
             partitionBy("eventDate", "category")
                .mode(Append)
                .parquet("s3://bucket/save/path");
    

    Alternatively you can write the files as

    In job-1 specify the parquet file path as :

    dataFrame.write().mode(Append)            
    .parquet("s3://bucket/save/path/eventDate=20160101/channel=billing_events")
    

    & in job-2 specify the parquet file path as :

    dataFrame.write().mode(Append)            
    .parquet("s3://bucket/save/path/eventDate=20160101/channel=click_events")
    
    1. Both jobs will create seperate _temporary directory under the respective folder so concurrency issue is solved.
    2. And partition discovery will also happen as eventDate=20160101 and for channel column.
    3. Disadvantage - even if channel=click_events do not exists in data still parquet file for the channel=click_events will be created.
    0 讨论(0)
  • 2020-12-08 01:33

    So after much reading about how to tackle this problem I thought id transfer some wisdom back here to wrap things up. Thanks mostly to Tal's comments.

    I've additionally found that writing directly to the s3://bucket/save/path seems dangerous because if a job is killed and the cleanup of the temporary folder doesnt happen at the end of the job, it seems like its left there for the next job and i've noticed sometimes the previous killed jobs temp's files land in the s3://bucket/save/path and causes duplication... Totally unreliable...

    Additionally, the rename operation of the _temporary folder files to their appropriate s3 files, takes a horrendous amount of time (approx 1 sec per file) as S3 only supports copy/delete not rename. Additionally, only the driver instance renames these files using a single thread so as much as 1/5 of some jobs with large numbers of files/partitions are spent just waiting for rename operations.

    I've ruled out using the DirectOutputCommitter for a number of reasons.

    1. When used in conjunction with speculation mode it results in duplication (https://issues.apache.org/jira/browse/SPARK-9899)
    2. Task failures will leave clutter which would be impossible to find and remove/clean later.
    3. Spark 2.0 has removed support for this completely and no upgrade path exists.(https://issues.apache.org/jira/browse/SPARK-10063)

    The only safe, performant, and consistent way to execute these jobs is to save them to unique temporary folder (unique by applicationId or timestamp) in hdfs first. And copy to S3 on job completion.

    This allows concurrent jobs to execute as they will save to unique temp folders, no need to use the DirectOutputCommitter as the rename operation on HDFS is quicker than S3, and the saved data is more consistent.

    0 讨论(0)
  • 2020-12-08 01:41

    Multiple write tasks for same path with "partitionBy", will FAILED when _temporary been delete in cleanupJob of FileOutputCommitter , like No such file or directory .

    TEST CODE:

    def batchTask[A](TASK_tag: String, taskData: TraversableOnce[A], batchSize: Int, fTask: A => Unit, fTaskId: A => String): Unit = {
      var list = new scala.collection.mutable.ArrayBuffer[(String, java.util.concurrent.Future[Int])]()
      val executors = java.util.concurrent.Executors.newFixedThreadPool(batchSize)
      try {
        taskData.foreach(d => {
          val task = executors.submit(new java.util.concurrent.Callable[Int] {
            override def call(): Int = {
              fTask(d)
              1
            }
          })
          list += ((fTaskId(d), task))
        })
        var count = 0
        list.foreach(r => if (!r._2.isCancelled) count += r._2.get())
      } finally {
        executors.shutdown()
      }
    }
    def testWriteFail(outPath: String)(implicit spark: SparkSession, sc: SparkContext): Unit = {
      println(s"try save: ${outPath}")
      import org.apache.spark.sql.functions._
      import spark.sqlContext.implicits._
      batchTask[Int]("test", 1 to 20, 6, t => {
        val df1 =
          Seq((1, "First Value", java.sql.Date.valueOf("2010-01-01")), (2, "Second Value", java.sql.Date.valueOf("2010-02-01")))
            .toDF("int_column", "string_column", "date_column")
            .withColumn("t0", lit(t))
        df1.repartition(1).write
          .mode("overwrite")
          .option("mapreduce.fileoutputcommitter.marksuccessfuljobs", false)
          .partitionBy("t0").csv(outPath)
      }, t => f"task.${t}%4d") // some Exception
      println(s"fail: count=${spark.read.csv(outPath).count()}")
    }
    try {
      testWriteFail(outPath + "/fail")
    } catch {
      case e: Throwable =>
    }
    

    Failed

    Use OutputCommitter :

    package org.jar.spark.util
    import java.io.IOException
    /*
      * 用于 DataFrame 多任务写入同一个目录。
      * <pre>
      * 1. 基于临时目录写入
      * 2. 如果【任务的输出】可能会有重叠,不要使用 overwrite 方式,以免误删除
      * </pre>
      * <p/>
      * Created by liao on 2018-12-02.
      */
    object JMultiWrite {
      val JAR_Write_Cache_Flag = "jar.write.cache.flag"
      val JAR_Write_Cache_TaskId = "jar.write.cache.taskId"
      /** 自动删除目标目录下同名子目录 */
      val JAR_Write_Cache_Overwrite = "jar.write.cache.overwrite"
      implicit class ImplicitWrite[T](dw: org.apache.spark.sql.DataFrameWriter[T]) {
        /**
          * 输出到文件,需要在外面配置 option format mode 等
          *
          * @param outDir    输出目标目录
          * @param taskId    此次任务ID,用于隔离各任务的输出,必须具有唯一性
          * @param cacheDir  缓存目录,最好是 '_' 开头的目录,如 "_jarTaskCache"
          * @param overwrite 是否删除已经存在的目录,默认 false 表示 Append模式
          *                  <font color=red>(如果 并行任务可能有相同 子目录输出时,会冲掉,此时不要使用 overwrite)</font>
          */
        def multiWrite(outDir: String, taskId: String, cacheDir: String = "_jarTaskCache", overwrite: Boolean = false): Boolean = {
          val p = path(outDir, cacheDir, taskId)
          dw.options(options(cacheDir, taskId))
            .option(JAR_Write_Cache_Overwrite, overwrite)
            .mode(org.apache.spark.sql.SaveMode.Overwrite)
            .save(p)
          true
        }
      }
      def options(cacheDir: String, taskId: String): Map[String, String] = {
        Map(JAR_Write_Cache_Flag -> cacheDir,
          JAR_Write_Cache_TaskId -> taskId,
          "mapreduce.fileoutputcommitter.marksuccessfuljobs" -> "false",
          "mapreduce.job.outputformat.class" -> classOf[JarOutputFormat].getName
        )
      }
      def path(outDir: String, cacheDir: String, taskId: String): String = {
        assert(outDir != "", "need OutDir")
        assert(cacheDir != "", "need CacheDir")
        assert(taskId != "", "needTaskId")
        outDir + "/" + cacheDir + "/" + taskId
      }
      /*-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-*/
      class JarOutputFormat extends org.apache.hadoop.mapreduce.lib.output.TextOutputFormat {
        var committer: org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter = _
    
        override def getOutputCommitter(context: org.apache.hadoop.mapreduce.TaskAttemptContext): org.apache.hadoop.mapreduce.OutputCommitter = {
          if (this.committer == null) {
            val output = org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.getOutputPath(context)
            this.committer = new JarOutputCommitter(output, context)
          }
          this.committer
        }
      }
      class JarOutputCommitter(output: org.apache.hadoop.fs.Path, context: org.apache.hadoop.mapreduce.TaskAttemptContext)
        extends org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter(output, context) {
        override def commitJob(context: org.apache.hadoop.mapreduce.JobContext): Unit = {
          val finalOutput = this.output
          val cacheFlag = context.getConfiguration.get(JAR_Write_Cache_Flag, "")
          val myTaskId = context.getConfiguration.get(JAR_Write_Cache_TaskId, "")
          val overwrite = context.getConfiguration.getBoolean(JAR_Write_Cache_Overwrite, false)
          val hasCacheFlag = finalOutput.getName == myTaskId && finalOutput.getParent.getName == cacheFlag
          val finalReal = if (hasCacheFlag) finalOutput.getParent.getParent else finalOutput // 确定最终目录
          // 遍历输出目录
          val fs = finalOutput.getFileSystem(context.getConfiguration)
          val jobAttemptPath = getJobAttemptPath(context)
          val arr$ = fs.listStatus(jobAttemptPath, new org.apache.hadoop.fs.PathFilter {
            override def accept(path: org.apache.hadoop.fs.Path): Boolean = !"_temporary".equals(path.getName())
          })
          if (hasCacheFlag && overwrite) // 移除同名子目录
          {
            if (fs.isDirectory(finalReal)) arr$.foreach(stat =>
              if (fs.isDirectory(stat.getPath)) fs.listStatus(stat.getPath).foreach(stat2 => {
                val p1 = stat2.getPath
                val p2 = new org.apache.hadoop.fs.Path(finalReal, p1.getName)
                if (fs.isDirectory(p1) && fs.isDirectory(p2) && !fs.delete(p2, true)) throw new IOException("Failed to delete " + p2)
              })
            )
          }
          arr$.foreach(stat => {
            mergePaths(fs, stat, finalReal)
          })
          cleanupJob(context)
          if (hasCacheFlag) { // 移除缓存目录
            try {
              fs.delete(finalOutput, false)
              val pp = finalOutput.getParent
              if (fs.listStatus(pp).isEmpty)
                fs.delete(pp, false)
            } catch {
              case e: Exception =>
            }
          }
          // 不用输出 _SUCCESS 了
          //if (context.getConfiguration.getBoolean("mapreduce.fileoutputcommitter.marksuccessfuljobs", true)) {
          //  val markerPath = new org.apache.hadoop.fs.Path(this.outputPath, "_SUCCESS")
          //  fs.create(markerPath).close()
          //}
        }
      }
      @throws[IOException]
      def mergePaths(fs: org.apache.hadoop.fs.FileSystem, from: org.apache.hadoop.fs.FileStatus, to: org.apache.hadoop.fs.Path): Unit = {
        if (from.isFile) {
          if (fs.exists(to) && !fs.delete(to, true)) throw new IOException("Failed to delete " + to)
          if (!fs.rename(from.getPath, to)) throw new IOException("Failed to rename " + from + " to " + to)
        }
        else if (from.isDirectory) if (fs.exists(to)) {
          val toStat = fs.getFileStatus(to)
          if (!toStat.isDirectory) {
            if (!fs.delete(to, true)) throw new IOException("Failed to delete " + to)
            if (!fs.rename(from.getPath, to)) throw new IOException("Failed to rename " + from + " to " + to)
          }
          else {
            val arr$ = fs.listStatus(from.getPath)
            for (subFrom <- arr$) {
              mergePaths(fs, subFrom, new org.apache.hadoop.fs.Path(to, subFrom.getPath.getName))
            }
          }
        }
        else if (!fs.rename(from.getPath, to)) throw new IOException("Failed to rename " + from + " to " + to)
      }
    }
    

    And then:

    def testWriteOk(outPath: String)(implicit spark: SparkSession, sc: SparkContext): Unit = {
      println(s"try save: ${outPath}")
      import org.apache.spark.sql.functions._
      import org.jar.spark.util.JMultiWrite.ImplicitWrite // 导入工具
      import spark.sqlContext.implicits._
      batchTask[Int]("test.ok", 1 to 20, 6, t => {
        val taskId = t.toString
        val df1 =
          Seq((1, "First Value", java.sql.Date.valueOf("2010-01-01")), (2, "Second Value", java.sql.Date.valueOf("2010-02-01")))
            .toDF("int_column", "string_column", "date_column")
            .withColumn("t0", lit(taskId))
        df1.repartition(1).write
          .partitionBy("t0")
          .format("csv")
          .multiWrite(outPath, taskId, overwrite = true) // 这里使用了 overwrite ,如果分区有重叠,请不要使用 overwrite
      }, t => f"task.${t}%4d")
      println(s"ok: count=${spark.read.csv(outPath).count()}") // 40
    }
    try {
      testWriteOk(outPath + "/ok")
    } catch {
      case e: Throwable =>
    }
    

    Success:

    $  ls ok/
    t0=1  t0=10 t0=11 t0=12 t0=13 t0=14 t0=15 t0=16 t0=17 t0=18 t0=19 t0=2  t0=20 t0=3  t0=4  t0=5  t0=6  t0=7  t0=8  t0=9
    

    The same applies to other output formats, pay attention to the use of overwrite .

    Test on spark 2.11.8.

    Thanks for @Tal Joffe

    0 讨论(0)
  • 2020-12-08 01:49

    I suspect this is because of the changes to partition discovery that were introduced in Spark 1.6. The changes means that Spark will only treat paths like .../xxx=yyy/ as partitions if you have specified a "basepath"-option (see Spark release notes here).

    So I think your problem will be solved if you add the basepath-option, like this:

    dataFrame
      .write()
      .partitionBy("eventDate", "category")
      .option("basepath", "s3://bucket/save/path")
      .mode(Append)
      .parquet("s3://bucket/save/path");
    

    (I haven't had the chance to verify it, but hopefully it will do the trick :))

    0 讨论(0)
提交回复
热议问题