问题
I have a RDD[Double]
, I want to divide the RDD
into k
equal intervals, then count the number of each equal distance interval in RDD.
For example, the RDD
is like [0,1,2,3,4,5,6,6,7,7,10]
. I want to divided it into 10
equal intervals, so the intervals are [0,1), [1,2), [2,3), [3,4), [4,5), [5,6), [6,7), [7,8), [8,9), [9,10]
.
As you can see, each element of RDD will be in one of the intervals. Then I want to calculate the number of each interval. Here, there are one element in [0,1),[1,2),[2,3),[3,4),[4,5),[5,6)
, and both [6,7)
and [7,8)
have two element. [9,10]
has one element.
Finally I expected an array like array([1,1,1,1,1,1,2,2,0,1]
.
回答1:
Try this. I have assumed that first element of the range is inclusive and last exclusive. Please confirm on this. For example when considering the range [0,1] and element is 0 the condition is element >= 0 and element < 1.
for index_upper, element_upper in enumerate(array_range):
counter = 0
for index, element in enumerate(rdd.collect()):
if element >= element_upper[0] and element < element_upper[1] :
counter +=1
countElementsWithinRange.append(counter)
print(rdd.collect())
# [0, 1, 2, 3, 4, 5, 6, 6, 7, 7, 10]
print(countElementsWithinRange)
# [1, 1, 1, 1, 1, 1, 2, 2, 0, 0]
来源:https://stackoverflow.com/questions/62675710/pyspark-how-to-count-the-number-of-each-equal-distance-interval-in-rdd