How to force DataFrame evaluation in Spark

后端 未结 4 870
感情败类
感情败类 2020-11-28 15:27

Sometimes (e.g. for testing and bechmarking) I want force the execution of the transformations defined on a DataFrame. AFAIK calling an action like count does n

4条回答
  •  半阙折子戏
    2020-11-28 15:53

    It appears that df.cache.count is the way to go:

    scala> val myUDF = udf((i:Int) => {if(i==1000) throw new RuntimeException;i})
    myUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(,IntegerType,Some(List(IntegerType)))
    
    scala> val df = sc.parallelize(1 to 1000).toDF("id")
    df: org.apache.spark.sql.DataFrame = [id: int]
    
    scala> df.withColumn("test",myUDF($"id")).show(10)
    [rdd_51_0]
    +---+----+
    | id|test|
    +---+----+
    |  1|   1|
    |  2|   2|
    |  3|   3|
    |  4|   4|
    |  5|   5|
    |  6|   6|
    |  7|   7|
    |  8|   8|
    |  9|   9|
    | 10|  10|
    +---+----+
    only showing top 10 rows
    
    scala> df.withColumn("test",myUDF($"id")).count
    res13: Long = 1000
    
    scala> df.withColumn("test",myUDF($"id")).cache.count
    org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (int) => int)
            at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
    .
    .
    .
    Caused by: java.lang.RuntimeException
    

    Source

提交回复
热议问题