Why does join fail with “java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]”?

前端 未结 4 1445
Happy的楠姐
Happy的楠姐 2020-11-30 19:33

I am using Spark 1.5.

I have two dataframes of the form:

scala> libriFirstTable50Plus3DF
res1: org.apache.spark.sql.DataFrame = [basket_id: string         


        
4条回答
  •  情书的邮戳
    2020-11-30 20:04

    This happens because Spark tries to do Broadcast Hash Join and one of the DataFrames is very large, so sending it consumes much time.

    You can:

    1. Set higher spark.sql.broadcastTimeout to increase timeout - spark.conf.set("spark.sql.broadcastTimeout", newValueForExample36000)
    2. persist() both DataFrames, then Spark will use Shuffle Join - reference from here

    PySpark

    In PySpark, you can set the config when you build the spark context in the following manner:

    spark = SparkSession
      .builder
      .appName("Your App")
      .config("spark.sql.broadcastTimeout", "36000")
      .getOrCreate()
    

提交回复
热议问题