How to save a spark DataFrame as csv on disk?

后端 未结 4 2006
你的背包
你的背包 2020-11-29 03:05

For example, the result of this:

df.filter(\"project = \'en\'\").select(\"title\",\"count\").groupBy(\"title\").sum()

would return an Array

4条回答
  •  执笔经年
    2020-11-29 03:36

    I had similar problem. I needed to write down csv file on driver while I was connect to cluster in client mode.

    I wanted to reuse the same CSV parsing code as Apache Spark to avoid potential errors.

    I checked spark-csv code and found code responsible for converting dataframe into raw csv RDD[String] in com.databricks.spark.csv.CsvSchemaRDD.

    Sadly it is hardcoded with sc.textFile and the end of relevant method.

    I copy-pasted that code and removed last lines with sc.textFile and returned RDD directly instead.

    My code:

    /*
      This is copypasta from com.databricks.spark.csv.CsvSchemaRDD
      Spark's code has perfect method converting Dataframe -> raw csv RDD[String]
      But in last lines of that method it's hardcoded against writing as text file -
      for our case we need RDD.
     */
    object DataframeToRawCsvRDD {
    
      val defaultCsvFormat = com.databricks.spark.csv.defaultCsvFormat
    
      def apply(dataFrame: DataFrame, parameters: Map[String, String] = Map())
               (implicit ctx: ExecutionContext): RDD[String] = {
        val delimiter = parameters.getOrElse("delimiter", ",")
        val delimiterChar = if (delimiter.length == 1) {
          delimiter.charAt(0)
        } else {
          throw new Exception("Delimiter cannot be more than one character.")
        }
    
        val escape = parameters.getOrElse("escape", null)
        val escapeChar: Character = if (escape == null) {
          null
        } else if (escape.length == 1) {
          escape.charAt(0)
        } else {
          throw new Exception("Escape character cannot be more than one character.")
        }
    
        val quote = parameters.getOrElse("quote", "\"")
        val quoteChar: Character = if (quote == null) {
          null
        } else if (quote.length == 1) {
          quote.charAt(0)
        } else {
          throw new Exception("Quotation cannot be more than one character.")
        }
    
        val quoteModeString = parameters.getOrElse("quoteMode", "MINIMAL")
        val quoteMode: QuoteMode = if (quoteModeString == null) {
          null
        } else {
          QuoteMode.valueOf(quoteModeString.toUpperCase)
        }
    
        val nullValue = parameters.getOrElse("nullValue", "null")
    
        val csvFormat = defaultCsvFormat
          .withDelimiter(delimiterChar)
          .withQuote(quoteChar)
          .withEscape(escapeChar)
          .withQuoteMode(quoteMode)
          .withSkipHeaderRecord(false)
          .withNullString(nullValue)
    
        val generateHeader = parameters.getOrElse("header", "false").toBoolean
        val headerRdd = if (generateHeader) {
          ctx.sparkContext.parallelize(Seq(
            csvFormat.format(dataFrame.columns.map(_.asInstanceOf[AnyRef]): _*)
          ))
        } else {
          ctx.sparkContext.emptyRDD[String]
        }
    
        val rowsRdd = dataFrame.rdd.map(row => {
          csvFormat.format(row.toSeq.map(_.asInstanceOf[AnyRef]): _*)
        })
    
        headerRdd union rowsRdd
      }
    
    }
    

提交回复
热议问题