How to write unit tests in Spark 2.0+?

后端未结
关注
 6  428
日久生厌 2020-11-29 16:00
I\'ve been trying to find a reasonable way to test SparkSession with the JUnit testing framework. While there seem to be good examples for SparkContext

      
      
        
          6条回答        

        
                    
            
            
                         
                
              
              
                
                   情深已故
                                             
                
                
                (楼主)
            
              
              
                2020-11-29 16:57
              

            
            
                        
I like to create a SparkSessionTestWrapper trait that can be mixed in to test classes.  Shankar's approach works, but it's prohibitively slow for test suites with multiple files.

import org.apache.spark.sql.SparkSession

trait SparkSessionTestWrapper {

  lazy val spark: SparkSession = {
    SparkSession.builder().master("local").appName("spark session").getOrCreate()
  }

}


The trait can be used as follows:

class DatasetSpec extends FunSpec with SparkSessionTestWrapper {

  import spark.implicits._

  describe("#count") {

    it("returns a count of all the rows in a DataFrame") {

      val sourceDF = Seq(
        ("jets"),
        ("barcelona")
      ).toDF("team")

      assert(sourceDF.count === 2)

    }

  }

}


Check the spark-spec project for a real-life example that uses the SparkSessionTestWrapper approach.

Update

The spark-testing-base library automatically adds the SparkSession when certain traits are mixed in to the test class (e.g. when DataFrameSuiteBase is mixed in, you'll have access to the SparkSession via the spark variable).

I created a separate testing library called spark-fast-tests to give the users full control of the SparkSession when running their tests.  I don't think a test helper library should set the SparkSession.  Users should be able to start and stop their SparkSession as they see fit (I like to create one SparkSession and use it throughout the test suite run).

Here's an example of the spark-fast-tests assertSmallDatasetEquality method in action:

import com.github.mrpowers.spark.fast.tests.DatasetComparer

class DatasetSpec extends FunSpec with SparkSessionTestWrapper with DatasetComparer {

  import spark.implicits._

    it("aliases a DataFrame") {

      val sourceDF = Seq(
        ("jose"),
        ("li"),
        ("luisa")
      ).toDF("name")

      val actualDF = sourceDF.select(col("name").alias("student"))

      val expectedDF = Seq(
        ("jose"),
        ("li"),
        ("luisa")
      ).toDF("student")

      assertSmallDatasetEquality(actualDF, expectedDF)

    }

  }

}

    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它6个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复