How do I unit test PySpark programs?

后端 未结 7 1899
你的背包
你的背包 2020-12-12 17:01

My current Java/Spark Unit Test approach works (detailed here) by instantiating a SparkContext using \"local\" and running unit tests using JUnit.

The code has to be

相关标签:
7条回答
  • 2020-12-12 18:08

    Here's a solution with pytest if you're using Spark 2.x and SparkSession. I'm also importing a third party package.

    import logging
    
    import pytest
    from pyspark.sql import SparkSession
    
    def quiet_py4j():
        """Suppress spark logging for the test context."""
        logger = logging.getLogger('py4j')
        logger.setLevel(logging.WARN)
    
    
    @pytest.fixture(scope="session")
    def spark_session(request):
        """Fixture for creating a spark context."""
    
        spark = (SparkSession
                 .builder
                 .master('local[2]')
                 .config('spark.jars.packages', 'com.databricks:spark-avro_2.11:3.0.1')
                 .appName('pytest-pyspark-local-testing')
                 .enableHiveSupport()
                 .getOrCreate())
        request.addfinalizer(lambda: spark.stop())
    
        quiet_py4j()
        return spark
    
    
    def test_my_app(spark_session):
       ...
    

    Note if using Python 3, I had to specify that as a PYSPARK_PYTHON environment variable:

    import os
    import sys
    
    IS_PY2 = sys.version_info < (3,)
    
    if not IS_PY2:
        os.environ['PYSPARK_PYTHON'] = 'python3'
    

    Otherwise you get the error:

    Exception: Python in worker has different version 2.7 than that in driver 3.5, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

    0 讨论(0)
提交回复
热议问题