Global counter in pyspark

问题

Why does the counter I wrote with pyspark below does not always provide me with the right result, is it related to the global counter?

def increment_counter():
    global counter
    counter += 1

def get_number_of_element(rdd):
    global counter
    counter = 0
    rdd.foreach(lambda x:increment_counter())
    return counter

回答1:

Your global variable is only defined on the driver node, which means that it will work fine until you are running on localhost. As soon as you will distribute your job to multiple processes, they will not have access to the counter variable, and will just create a new one in their own process. So the final result will only contain the increments done in the driver process.

What you are looking for is a quite common usage though, and is covered by the accumulator feature of Spark. Accumulators are distributed and collected at the end of the process, so the totals will contain the increments of all nodes instead of only the driver node.

Accumulators - Spark Programming Guide

来源：https://stackoverflow.com/questions/40873538/global-counter-in-pyspark

标签

apache-spark

pyspark

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!