How to apply multiple filters in a for loop for pyspark

三世轮回 提交于 2021-01-23 11:09:18

问题


I am trying to apply a filter on several columns on an rdd. I want to pass in a list of indices as a parameter to specify which ones to filter on, but pyspark only applies the last filter.

I've broken down the code into some simple test cases and tried the non-looped version and they work.

test_input = [('0', '00'), ('1', '1'), ('', '22'), ('', '3')]

rdd = sc.parallelize(test_input, 1)
# Index 0 needs to be longer than length 0
# Index 1 needs to be longer than length 1
for i in [0,1]:
    rdd = rdd.filter(lambda arr: len(arr[i]) > i)

rdd.top(5)

# rdd.top(5) gives [('0', '00'), ('', '22')]
# Only 2nd filter applied

VS

test_input = [('0', '00'), ('1', '1'), ('', '22'), ('', '3')]

rdd = sc.parallelize(test_input, 1)
rdd = rdd.filter(lambda arr: len(arr[0]) > 0)
rdd = rdd.filter(lambda arr: len(arr[1]) > 1)

rdd.top(5)
# rdd.top(5) gives [('00', '00')] as expected

I expect the loop to give identical results compared to the non-looped version

来源:https://stackoverflow.com/questions/57154430/how-to-apply-multiple-filters-in-a-for-loop-for-pyspark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!