问题
Why does the counter I wrote with pyspark below does not always provide me with the right result, is it related to the global counter?
def increment_counter():
global counter
counter += 1
def get_number_of_element(rdd):
global counter
counter = 0
rdd.foreach(lambda x:increment_counter())
return counter
回答1:
Your global variable is only defined on the driver node, which means that it will work fine until you are running on localhost.
As soon as you will distribute your job to multiple processes, they will not have access to the counter
variable, and will just create a new one in their own process. So the final result will only contain the increments done in the driver process.
What you are looking for is a quite common usage though, and is covered by the accumulator feature of Spark. Accumulators are distributed and collected at the end of the process, so the totals will contain the increments of all nodes instead of only the driver node.
Accumulators - Spark Programming Guide
来源:https://stackoverflow.com/questions/40873538/global-counter-in-pyspark