How does the pyspark mapPartitions function work?

怎甘沉沦 提交于 2019-11-28 17:36:44

mapPartition should be thought of as a map operation over partitions and not over the elements of the partition. It's input is the set of current partitions its output will be another set of partitions.

The function you pass map must take an individual element of your RDD

The function you pass mapPartition must take an iterable of your RDD type and return and iterable of some other or the same type.

In your case you probably just want to do something like

def filterOut2(line):
    return [x for x in line if x != 2]

filtered_lists = data.map(filterOut2)

if you wanted to use mapPartition it would be

def filterOut2FromPartion(list_of_lists):
  final_iterator = []
  for sub_list in list_of_lists:
    final_iterator.append( [x for x in sub_list if x != 2])
  return iter(final_iterator)

filtered_lists = data.mapPartition(filterOut2FromPartion)

It's easier to use mapPartitions with a generator function using the yield syntax:

def filter_out_2(partition):
    for element in partition:
        if element != 2:
            yield element

filtered_lists = data.mapPartitions(filter_out_2)

Need a final Iter

def filter_out_2(partition):
for element in partition:
    sec_iterator = []
    for i in element:
        if i!= 2:
            sec_iterator.append(i)
    yield sec_iterator

filtered_lists = data.mapPartitions(filter_out_2)
for i in filtered_lists.collect(): print(i)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!