问题
The Spark API Doc's show how to get a pretty-print snippit from a dataset or dataframe sent to stdout.
Can this output be directed to a log4j logger? Alternately: can someone share code which will create output formatted similarly to the df.show()?
Is there a way to do this which allow stdout to go to the console both before and after pushing the .show() output to the logger?
http://spark.apache.org/docs/latest/sql-programming-guide.htm
val df = spark.read.json("examples/src/main/resources/people.json")
// Displays the content of the DataFrame to stdout
df.show()
// +----+-------+
// | age| name|
// +----+-------+
// |null|Michael|
// | 30| Andy|
// | 19| Justin|
// +----+-------+
回答1:
The showString() function from teserecter comes from Spark code (Dataset.scala).
You can't use that function from your code because it's package private but you can place the following snippet in a file DatasetShims.scala in your source code and mix-in the trait in your classes to access the function.
package org.apache.spark.sql
trait DatasetShims {
implicit class DatasetHelper[T](ds: Dataset[T]) {
def toShowString(numRows: Int = 20, truncate: Int = 20, vertical: Boolean = false): String =
"\n" + ds.showString(numRows, truncate, vertical)
}
}
回答2:
Put this utility method somewhere in your code to produce a formatted string with the dataframe.show() format.
Then just include it in your logging output like:
log.info("at this point the dataframe named df shows as \n"+showString(df,100,-40))
/**
* Compose the string representing rows for output
*
* @param _numRows Number of rows to show
* @param truncate If set to more than 0, truncates strings to `truncate` characters and
* all cells will be aligned right.
*/
def showString(
df:DataFrame
,_numRows: Int = 20
,truncateWidth: Int = 20
): String = {
val numRows = _numRows.max(0)
val takeResult = df.take(numRows + 1)
val hasMoreData = takeResult.length > numRows
val data = takeResult.take(numRows)
// For array values, replace Seq and Array with square brackets
// For cells that are beyond `truncate` characters, replace it with the
// first `truncate-3` and "..."
val rows: Seq[Seq[String]] = df.schema.fieldNames.toSeq +: data.map { row =>
row.toSeq.map { cell =>
val str = cell match {
case null => "null"
case binary: Array[Byte] => binary.map("%02X".format(_)).mkString("[", " ", "]")
case array: Array[_] => array.mkString("[", ", ", "]")
case seq: Seq[_] => seq.mkString("[", ", ", "]")
case _ => cell.toString
}
if (truncateWidth > 0 && str.length > truncateWidth) {
// do not show ellipses for strings shorter than 4 characters.
if (truncateWidth < 4) str.substring(0, truncateWidth)
else str.substring(0, truncateWidth - 3) + "..."
} else {
str
}
}: Seq[String]
}
来源:https://stackoverflow.com/questions/41600328/how-to-redirect-scala-spark-dataset-show-to-log4j-logger