问题
The Spark API Doc's show how to get a pretty-print snippit from a dataset or dataframe sent to stdout.
Can this output be directed to a log4j logger? Alternately: can someone share code which will create output formatted similarly to the df.show()?
Is there a way to do this which allow stdout to go to the console both before and after pushing the .show() output to the logger?
http://spark.apache.org/docs/latest/sql-programming-guide.htm
val df = spark.read.json("examples/src/main/resources/people.json")
// Displays the content of the DataFrame to stdout
df.show()
// +----+-------+
// | age| name|
// +----+-------+
// |null|Michael|
// | 30| Andy|
// | 19| Justin|
// +----+-------+
回答1:
The showString()
function from teserecter comes from Spark code (Dataset.scala
).
You can't use that function from your code because it's package private but you can place the following snippet in a file DatasetShims.scala
in your source code and mix-in the trait in your classes to access the function.
package org.apache.spark.sql
trait DatasetShims {
implicit class DatasetHelper[T](ds: Dataset[T]) {
def toShowString(numRows: Int = 20, truncate: Int = 20, vertical: Boolean = false): String =
"\n" + ds.showString(numRows, truncate, vertical)
}
}
回答2:
Put this utility method somewhere in your code to produce a formatted string with the dataframe.show() format.
Then just include it in your logging output like:
log.info("at this point the dataframe named df shows as \n"+showString(df,100,-40))
/**
* Compose the string representing rows for output
*
* @param _numRows Number of rows to show
* @param truncate If set to more than 0, truncates strings to `truncate` characters and
* all cells will be aligned right.
*/
def showString(
df:DataFrame
,_numRows: Int = 20
,truncateWidth: Int = 20
): String = {
val numRows = _numRows.max(0)
val takeResult = df.take(numRows + 1)
val hasMoreData = takeResult.length > numRows
val data = takeResult.take(numRows)
// For array values, replace Seq and Array with square brackets
// For cells that are beyond `truncate` characters, replace it with the
// first `truncate-3` and "..."
val rows: Seq[Seq[String]] = df.schema.fieldNames.toSeq +: data.map { row =>
row.toSeq.map { cell =>
val str = cell match {
case null => "null"
case binary: Array[Byte] => binary.map("%02X".format(_)).mkString("[", " ", "]")
case array: Array[_] => array.mkString("[", ", ", "]")
case seq: Seq[_] => seq.mkString("[", ", ", "]")
case _ => cell.toString
}
if (truncateWidth > 0 && str.length > truncateWidth) {
// do not show ellipses for strings shorter than 4 characters.
if (truncateWidth < 4) str.substring(0, truncateWidth)
else str.substring(0, truncateWidth - 3) + "..."
} else {
str
}
}: Seq[String]
}
来源:https://stackoverflow.com/questions/41600328/how-to-redirect-scala-spark-dataset-show-to-log4j-logger