I am looking for a solution to be able to log additional data when executing code on Apache Spark Nodes that could help investigate later some issues that might appear durin
If you need some code to be executed before and after a map, filter or other RDD function, try to use mapPartition, where the underlying iterator is passed explicitely.
Example:
val log = ??? // this gets captured and produced serialization error
rdd.map { x =>
log.info(x)
x+1
}
Becomes:
rdd.mapPartition { it =>
val log = ??? // this is freshly initialized in worker nodes
it.map { x =>
log.info(x)
x + 1
}
}
Every basic RDD function is always implemented with a mapPartition.
Make sure to handle the partitioner explicitly and not to loose it: see Scaladoc, preservesPartitioning parameter, this is critical for performances.