How to write unit tests in Spark 2.0+?

后端 未结 6 428
日久生厌
日久生厌 2020-11-29 16:00

I\'ve been trying to find a reasonable way to test SparkSession with the JUnit testing framework. While there seem to be good examples for SparkContext

6条回答
  •  情深已故
    2020-11-29 16:57

    I like to create a SparkSessionTestWrapper trait that can be mixed in to test classes. Shankar's approach works, but it's prohibitively slow for test suites with multiple files.

    import org.apache.spark.sql.SparkSession
    
    trait SparkSessionTestWrapper {
    
      lazy val spark: SparkSession = {
        SparkSession.builder().master("local").appName("spark session").getOrCreate()
      }
    
    }
    

    The trait can be used as follows:

    class DatasetSpec extends FunSpec with SparkSessionTestWrapper {
    
      import spark.implicits._
    
      describe("#count") {
    
        it("returns a count of all the rows in a DataFrame") {
    
          val sourceDF = Seq(
            ("jets"),
            ("barcelona")
          ).toDF("team")
    
          assert(sourceDF.count === 2)
    
        }
    
      }
    
    }
    

    Check the spark-spec project for a real-life example that uses the SparkSessionTestWrapper approach.

    Update

    The spark-testing-base library automatically adds the SparkSession when certain traits are mixed in to the test class (e.g. when DataFrameSuiteBase is mixed in, you'll have access to the SparkSession via the spark variable).

    I created a separate testing library called spark-fast-tests to give the users full control of the SparkSession when running their tests. I don't think a test helper library should set the SparkSession. Users should be able to start and stop their SparkSession as they see fit (I like to create one SparkSession and use it throughout the test suite run).

    Here's an example of the spark-fast-tests assertSmallDatasetEquality method in action:

    import com.github.mrpowers.spark.fast.tests.DatasetComparer
    
    class DatasetSpec extends FunSpec with SparkSessionTestWrapper with DatasetComparer {
    
      import spark.implicits._
    
        it("aliases a DataFrame") {
    
          val sourceDF = Seq(
            ("jose"),
            ("li"),
            ("luisa")
          ).toDF("name")
    
          val actualDF = sourceDF.select(col("name").alias("student"))
    
          val expectedDF = Seq(
            ("jose"),
            ("li"),
            ("luisa")
          ).toDF("student")
    
          assertSmallDatasetEquality(actualDF, expectedDF)
    
        }
    
      }
    
    }
    

提交回复
热议问题