Unit test pyspark code using python

前端未结

关注

 4  932

野趣味 2020-12-20 20:01

I have script in pyspark like below. I want to unit test a function in this script.

def rename_chars(column_name):
    chars = ((\'


      
      
        
          4条回答        

        
                    
            
            
                         
                
              
              
                
                   青春惊慌失措
                                             
                
                
                (楼主)
            
              
              
                2020-12-20 20:26
              

            
            
                        
Here's a lightweight way to test your function.  You don't need to download Spark to run PySpark tests like the accepted answer outlines.  Downloading Spark is an option, but it's not necessary.  Here's the test:

import pysparktestingexample.stackoverflow as SO
from chispa import assert_df_equality
import pyspark.sql.functions as F

def test_column_names(spark):
    source_data = [
        ("jose", "oak", "switch")
    ]
    source_df = spark.createDataFrame(source_data, ["some first name", "some.tree.type", "a gaming.system"])

    actual_df = SO.column_names(source_df)

    expected_data = [
        ("jose", "oak", "switch")
    ]
    expected_df = spark.createDataFrame(expected_data, ["some_&first_&name", "some_$tree_$type", "a_&gaming_$system"])

    assert_df_equality(actual_df, expected_df)


The SparkSession used by the test is defined in the tests/conftest.py file:

import pytest
from pyspark.sql import SparkSession

@pytest.fixture(scope='session')
def spark():
    return SparkSession.builder \
      .master("local") \
      .appName("chispa") \
      .getOrCreate()


The test uses the assert_df_equality function defined in the chispa library.

Here's your code and the test in a GitHub repo.

pytest is generally preferred in the Python community over unittest.  This blog post explains how to test PySpark programs and ironically has a modify_column_names function that'd let you rename these columns more elegantly. 

def modify_column_names(df, fun):
    for col_name in df.columns:
        df = df.withColumnRenamed(col_name, fun(col_name))
    return df

    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它4个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复