I am looking for a solution to be able to log additional data when executing code on Apache Spark Nodes that could help investigate later some issues that might appear durin
If you need some code to be executed before and after a map
, filter
or other RDD
function, try to use mapPartition
, where the underlying iterator is passed explicitely.
Example:
val log = ??? // this gets captured and produced serialization error
rdd.map { x =>
log.info(x)
x+1
}
Becomes:
rdd.mapPartition { it =>
val log = ??? // this is freshly initialized in worker nodes
it.map { x =>
log.info(x)
x + 1
}
}
Every basic RDD
function is always implemented with a mapPartition
.
Make sure to handle the partitioner explicitly and not to loose it: see Scaladoc, preservesPartitioning
parameter, this is critical for performances.