Should spark broadcast variables' type be number or string when I try to restart a job from checkpoint

▼魔方 西西 提交于 2019-12-13 05:47:56

问题


When I set a collection as broadcast variables, it always reback to me serialization error, I has already tried Map, HashMap, Array,all failed


回答1:


it's a known bug of Spark : https://issues.apache.org/jira/browse/SPARK-5206

you can use singleton object to let each executor loads the data itself . you can check https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaRecoverableNetworkWordCount.java for full exemple :

class JavaWordBlacklist {

  private static volatile Broadcast<List<String>> instance = null;

  public static Broadcast<List<String>> getInstance(JavaSparkContext jsc) {
    if (instance == null) {
      synchronized (JavaWordBlacklist.class) {
        if (instance == null) {
          List<String> wordBlacklist = Arrays.asList("a", "b", "c");
          instance = jsc.broadcast(wordBlacklist);
        }
      }
    }
    return instance;
  }
}



public static void main(String[] args) throws Exception {
    ... 
    Function0<JavaStreamingContext> createContextFunc =
        () -> createContext(ip, port, checkpointDirectory, outputPath);

    JavaStreamingContext ssc =
      JavaStreamingContext.getOrCreate(checkpointDirectory, createContextFunc);
    ssc.start();
}

private static JavaStreamingContext createContext(String ip,
                                                    int port,
                                                    String checkpointDirectory,
                                                    String outputPath) {
    SparkConf sparkConf = new SparkConf().setAppName("JavaRecoverableNetworkWordCount");
    JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(1));
    ssc.checkpoint(checkpointDirectory);
    ...
    wordCounts.foreachRDD((rdd, time) -> {
      // Get or register the blacklist Broadcast
      Broadcast<List<String>> blacklist =
          JavaWordBlacklist.getInstance(new JavaSparkContext(rdd.context()))
     ...
     }
...

}



来源:https://stackoverflow.com/questions/52553659/should-spark-broadcast-variables-type-be-number-or-string-when-i-try-to-restart

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!