PySpark logging from the executor

后端 未结 3 1863
醉梦人生
醉梦人生 2020-11-29 03:11

What is the correct way to access the log4j logger of Spark using pyspark on an executor?

It\'s easy to do so in the driver but I cannot seem to understand how to ac

3条回答
  •  既然无缘
    2020-11-29 03:32

    You cannot use local log4j logger on executors. Python workers spawned by executors jvms has no "callback" connection to the java, they just receive commands. But there is a way to log from executors using standard python logging and capture them by YARN.

    On your HDFS place python module file that configures logging once per python worker and proxies logging functions (name it logger.py):

    import os
    import logging
    import sys
    
    class YarnLogger:
        @staticmethod
        def setup_logger():
            if not 'LOG_DIRS' in os.environ:
                sys.stderr.write('Missing LOG_DIRS environment variable, pyspark logging disabled')
                return 
    
            file = os.environ['LOG_DIRS'].split(',')[0] + '/pyspark.log'
            logging.basicConfig(filename=file, level=logging.INFO, 
                    format='%(asctime)s.%(msecs)03d %(levelname)s %(module)s - %(funcName)s: %(message)s')
    
        def __getattr__(self, key):
            return getattr(logging, key)
    
    YarnLogger.setup_logger()
    

    Then import this module inside your application:

    spark.sparkContext.addPyFile('hdfs:///path/to/logger.py')
    import logger
    logger = logger.YarnLogger()
    

    And you can use in inside your pyspark functions like normal logging library:

    def map_sth(s):
        logger.info("Mapping " + str(s))
        return s
    
    spark.range(10).rdd.map(map_sth).count()
    

    The pyspark.log will be visible on resource manager and will be collected on application finish, so you can access these logs later with yarn logs -applicationId .....

提交回复
热议问题