How to export data from Spark SQL to CSV

前端 未结 7 1243
迷失自我
迷失自我 2020-12-04 15:31

This command works with HiveQL:

insert overwrite directory \'/data/home.csv\' select * from testtable;

But with Spark SQL I\'m getting an e

相关标签:
7条回答
  • 2020-12-04 16:16

    The simplest way is to map over the DataFrame's RDD and use mkString:

      df.rdd.map(x=>x.mkString(","))
    

    As of Spark 1.5 (or even before that) df.map(r=>r.mkString(",")) would do the same if you want CSV escaping you can use apache commons lang for that. e.g. here's the code we're using

     def DfToTextFile(path: String,
                       df: DataFrame,
                       delimiter: String = ",",
                       csvEscape: Boolean = true,
                       partitions: Int = 1,
                       compress: Boolean = true,
                       header: Option[String] = None,
                       maxColumnLength: Option[Int] = None) = {
    
        def trimColumnLength(c: String) = {
          val col = maxColumnLength match {
            case None => c
            case Some(len: Int) => c.take(len)
          }
          if (csvEscape) StringEscapeUtils.escapeCsv(col) else col
        }
        def rowToString(r: Row) = {
          val st = r.mkString("~-~").replaceAll("[\\p{C}|\\uFFFD]", "") //remove control characters
          st.split("~-~").map(trimColumnLength).mkString(delimiter)
        }
    
        def addHeader(r: RDD[String]) = {
          val rdd = for (h <- header;
                         if partitions == 1; //headers only supported for single partitions
                         tmpRdd = sc.parallelize(Array(h))) yield tmpRdd.union(r).coalesce(1)
          rdd.getOrElse(r)
        }
    
        val rdd = df.map(rowToString).repartition(partitions)
        val headerRdd = addHeader(rdd)
    
        if (compress)
          headerRdd.saveAsTextFile(path, classOf[GzipCodec])
        else
          headerRdd.saveAsTextFile(path)
      }
    
    0 讨论(0)
提交回复
热议问题