How do I unit test PySpark programs?

后端 未结 7 1905
你的背包
你的背包 2020-12-12 17:01

My current Java/Spark Unit Test approach works (detailed here) by instantiating a SparkContext using \"local\" and running unit tests using JUnit.

The code has to be

7条回答
  •  暖寄归人
    2020-12-12 17:59

    I use pytest, which allows test fixtures so you can instantiate a pyspark context and inject it into all of your tests that require it. Something along the lines of

    @pytest.fixture(scope="session",
                    params=[pytest.mark.spark_local('local'),
                            pytest.mark.spark_yarn('yarn')])
    def spark_context(request):
        if request.param == 'local':
            conf = (SparkConf()
                    .setMaster("local[2]")
                    .setAppName("pytest-pyspark-local-testing")
                    )
        elif request.param == 'yarn':
            conf = (SparkConf()
                    .setMaster("yarn-client")
                    .setAppName("pytest-pyspark-yarn-testing")
                    .set("spark.executor.memory", "1g")
                    .set("spark.executor.instances", 2)
                    )
        request.addfinalizer(lambda: sc.stop())
    
        sc = SparkContext(conf=conf)
        return sc
    
    def my_test_that_requires_sc(spark_context):
        assert spark_context.textFile('/path/to/a/file').count() == 10
    

    Then you can run the tests in local mode by calling py.test -m spark_local or in YARN with py.test -m spark_yarn. This has worked pretty well for me.

提交回复
热议问题