how to redirect Scala Spark Dataset.show to log4j logger

问题

The Spark API Doc's show how to get a pretty-print snippit from a dataset or dataframe sent to stdout.

Can this output be directed to a log4j logger? Alternately: can someone share code which will create output formatted similarly to the df.show()?

Is there a way to do this which allow stdout to go to the console both before and after pushing the .show() output to the logger?

http://spark.apache.org/docs/latest/sql-programming-guide.htm

val df = spark.read.json("examples/src/main/resources/people.json")

// Displays the content of the DataFrame to stdout
df.show()
// +----+-------+
// | age|   name|
// +----+-------+
// |null|Michael|
// |  30|   Andy|
// |  19| Justin|
// +----+-------+

回答1:

The showString() function from teserecter comes from Spark code (Dataset.scala).

You can't use that function from your code because it's package private but you can place the following snippet in a file DatasetShims.scala in your source code and mix-in the trait in your classes to access the function.

package org.apache.spark.sql

trait DatasetShims {
  implicit class DatasetHelper[T](ds: Dataset[T]) {
    def toShowString(numRows: Int = 20, truncate: Int = 20, vertical: Boolean = false): String =
      "\n" + ds.showString(numRows, truncate, vertical)
  }
}

回答2:

Put this utility method somewhere in your code to produce a formatted string with the dataframe.show() format.

Then just include it in your logging output like:

log.info("at this point the dataframe named df shows as \n"+showString(df,100,-40))

/**
    * Compose the string representing rows for output
    *
    * @param _numRows Number of rows to show
    * @param truncate If set to more than 0, truncates strings to `truncate` characters and
    *                   all cells will be aligned right.
    */
    def showString(
        df:DataFrame
        ,_numRows: Int = 20
        ,truncateWidth: Int = 20
    ): String = {
        val numRows = _numRows.max(0)
        val takeResult = df.take(numRows + 1)
        val hasMoreData = takeResult.length > numRows
        val data = takeResult.take(numRows)

        // For array values, replace Seq and Array with square brackets
        // For cells that are beyond `truncate` characters, replace it with the
        // first `truncate-3` and "..."
        val rows: Seq[Seq[String]] = df.schema.fieldNames.toSeq +: data.map { row =>
            row.toSeq.map { cell =>
            val str = cell match {
                case null => "null"
                case binary: Array[Byte] => binary.map("%02X".format(_)).mkString("[", " ", "]")
                case array: Array[_] => array.mkString("[", ", ", "]")
                case seq: Seq[_] => seq.mkString("[", ", ", "]")
                case _ => cell.toString
            }
            if (truncateWidth > 0 && str.length > truncateWidth) {
                // do not show ellipses for strings shorter than 4 characters.
                if (truncateWidth < 4) str.substring(0, truncateWidth)
                else str.substring(0, truncateWidth - 3) + "..."
            } else {
                str
            }
        }: Seq[String]
    }

来源：https://stackoverflow.com/questions/41600328/how-to-redirect-scala-spark-dataset-show-to-log4j-logger

标签

scala

logging

apache-spark

dataset