Return an RDD from takeOrdered, instead of a list

问题

I'm using pyspark to do some data cleaning. A very common operation is to take a small-ish subset of a file and export it for inspection:

(self.spark_context.textFile(old_filepath+filename)
    .takeOrdered(100) 
    .saveAsTextFile(new_filepath+filename))

My problem is that takeOrdered is returning a list instead of an RDD, so saveAsTextFile doesn't work.

AttributeError: 'list' object has no attribute 'saveAsTextFile'

Of course, I could implement my own file writer. Or I could convert the list back into an RDD with parallelize. But I'm trying to be a spark purist here.

Isn't there any way to return an RDD from takeOrdered or an equivalent function?

回答1:

takeOrdered() is an action and not a transformation so you can't have it return an RDD.
If ordering isn't necessary, the simplest alternative would be sample().
If you do want ordering, you can try some combination of filter() and sortByKey() to reduce the number of elements and sort them. Or, as you suggested, re-parallelize the result of takeOrdered()

来源：https://stackoverflow.com/questions/32341897/return-an-rdd-from-takeordered-instead-of-a-list

标签

python

apache-spark

rdd

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!