PySpark logging from the executor

后端 未结 3 1867
醉梦人生
醉梦人生 2020-11-29 03:11

What is the correct way to access the log4j logger of Spark using pyspark on an executor?

It\'s easy to do so in the driver but I cannot seem to understand how to ac

相关标签:
3条回答
  • 2020-11-29 03:32

    You cannot use local log4j logger on executors. Python workers spawned by executors jvms has no "callback" connection to the java, they just receive commands. But there is a way to log from executors using standard python logging and capture them by YARN.

    On your HDFS place python module file that configures logging once per python worker and proxies logging functions (name it logger.py):

    import os
    import logging
    import sys
    
    class YarnLogger:
        @staticmethod
        def setup_logger():
            if not 'LOG_DIRS' in os.environ:
                sys.stderr.write('Missing LOG_DIRS environment variable, pyspark logging disabled')
                return 
    
            file = os.environ['LOG_DIRS'].split(',')[0] + '/pyspark.log'
            logging.basicConfig(filename=file, level=logging.INFO, 
                    format='%(asctime)s.%(msecs)03d %(levelname)s %(module)s - %(funcName)s: %(message)s')
    
        def __getattr__(self, key):
            return getattr(logging, key)
    
    YarnLogger.setup_logger()
    

    Then import this module inside your application:

    spark.sparkContext.addPyFile('hdfs:///path/to/logger.py')
    import logger
    logger = logger.YarnLogger()
    

    And you can use in inside your pyspark functions like normal logging library:

    def map_sth(s):
        logger.info("Mapping " + str(s))
        return s
    
    spark.range(10).rdd.map(map_sth).count()
    

    The pyspark.log will be visible on resource manager and will be collected on application finish, so you can access these logs later with yarn logs -applicationId .....

    0 讨论(0)
  • 2020-11-29 03:39

    Note that Mariusz's answer returns a proxy to the logging module. This works (upvoted) when your logging demands are very basic. Once you're interested in doing things like configuring multiple logger instances or using multiple handlers, it will be lacking. E.g. if you have a larger set of code that you only want to run when debugging, one of the solutions would be to check a logger instance's isEnabledFor method, like so:

    logger = logging.getLogger(__name__)
    if logger.isEnabledFor(logging.DEBUG):
        # do some heavy calculations and call `logger.debug` (or any other logging method, really)
    

    This would fail when the method is called on the logging module, like in Mariusz's answer, because the logging module does not have such an attribute.

    One way to solve this would be to create a spark_logging.py module in which you configure the logging and return a new instance of Logger. The code below shows an example of this, which configures logging using dictConfig. It also adds a filter so that the number of repetitions from all the worker nodes is greatly reduced when using the root logger (filter example is from Christopher Dunn (ref)).

    # spark_logging.py
    import logging
    import logging.config
    import os
    import tempfile
    from logging import *  # gives access to logging.DEBUG etc by aliasing this module for the standard logging module
    
    
    class Unique(logging.Filter):
        """Messages are allowed through just once.
        The 'message' includes substitutions, but is not formatted by the
        handler. If it were, then practically all messages would be unique!
        """
        def __init__(self, name=""):
            logging.Filter.__init__(self, name)
            self.reset()
    
        def reset(self):
            """Act as if nothing has happened."""
            self.__logged = {}
    
        def filter(self, rec):
            """logging.Filter.filter performs an extra filter on the name."""
            return logging.Filter.filter(self, rec) and self.__is_first_time(rec)
    
        def __is_first_time(self, rec):
            """Emit a message only once."""
            msg = rec.msg %(rec.args)
            if msg in self.__logged:
                self.__logged[msg] += 1
                return False
            else:
                self.__logged[msg] = 1
                return True
    
    
    def getLogger(name, logfile="pyspark.log"):
        """Replaces getLogger from logging to ensure each worker configures
        logging locally."""
    
        try:
            logfile = os.path.join(os.environ['LOG_DIRS'].split(',')[0], logfile)
        except (KeyError, IndexError):
            tmpdir = tempfile.gettempdir()
            logfile = os.path.join(tmpdir, logfile)
            rootlogger = logging.getLogger("")
            rootlogger.addFilter(Unique())
            rootlogger.warning(
                "LOG_DIRS not in environment variables or is empty. Will log to {}."
                .format(logfile))
    
        # Alternatively, load log settings from YAML or use JSON.
        log_settings = {
            'version': 1,
            'disable_existing_loggers': False,
            'handlers': {
                'file': {
                    'class': 'logging.FileHandler',
                    'level': 'DEBUG',
                    'formatter': 'detailed',
                    'filename': logfile
                },
                'default': {
                    'level': 'INFO',
                    'class': 'logging.StreamHandler',
                },
            },
            'formatters': {
                'detailed': {
                    'format': ("%(asctime)s.%(msecs)03d %(levelname)s %(module)s - "
                               "%(funcName)s: %(message)s"),
                },
            },
            'loggers': {
                'driver': {
                    'level': 'INFO',
                    'handlers': ['file', ]
                },
                'executor': {
                    'level': 'DEBUG',
                    'handlers': ['file', ]
                },
            }
        }
    
        logging.config.dictConfig(log_settings)
        return logging.getLogger(name)
    

    You could then import this module and alias it for logging itself:

    from pyspark.sql import SparkSession
    
    spark = SparkSession \
        .builder \
        .appName("Test logging") \
        .getOrCreate()
    
    try:
        spark.sparkContext.addPyFile('s3://YOUR_BUCKET/spark_logging.py')
    except:
        # Probably running this locally. Make sure to have spark_logging in the PYTHONPATH
        pass
    finally:
        import spark_logging as logging
    
    def map_sth(s):
        log3 = logging.getLogger("executor")
        log3.info("Logging from executor")
    
        if log3.isEnabledFor(logging.DEBUG):
            log3.debug("This statement is only logged when DEBUG is configured.")
    
        return s
    
    def main():
        log2 = logging.getLogger("driver")
        log2.info("Logging from within module function on driver")
        spark.range(100).rdd.map(map_sth).count()
    
    if __name__ == "__main__":
        log1 = logging.getLogger("driver")
        log1.info("logging from module level")
        main()
    

    Like with Mariusz's answer, logs will be accessible using the resource manager (or dumped in your temp-folder when LOG_DIRS is not in your environment variables). The error handling done at the top of this script is added so that you could run this script locally.

    This approach allows more freedom: you could have the executors log to one file and all kinds of aggregation counts on the drive in another file.

    Note that there is slightly more work to be done in this case, compared to using a class as a proxy for the built-in logging module, as each time you request a logger on the executor instances, it will have to be configured. That likely won't be your main time-hog when doing big data analytics though. ;-)

    0 讨论(0)
  • 2020-11-29 03:52

    I have yet another approach to solve logging issue in PySpark. Idea is as follows:

    • Use remote log management service (For example Loggly, CloudWatch on AWS, Application Insights on Azure etc)
    • Configure logging module in both master node and worker nodes using same configuration to send logs to above sevices

    This is good approach if you are already using cloud services as many of them also have log collection/management services.

    I have a simple wordcount example on Github to demonstrate this approach https://github.com/chhantyal/wordcount

    This Spark app sends logs to Loggly using standard logging module from driver (master node) as well as executors (worker nodes).

    0 讨论(0)
提交回复
热议问题