Pyspark RDD collect first 163 Rows

Is there a way to get the first 163 rows of an rdd without converting to a df?

I've tried something like newrdd = rdd.take(163), but that returns a list, and rdd.collect() returns the whole rdd.

Is there a way to do this? Or if not is there a way to convert a list into an rdd?

It is not very efficient but you can zipWithIndex and filter:

rdd.zipWithIndex().filter(lambda vi: vi[1] < 163).keys()

In practice it make more sense to simply take and parallelize:

sc.parallelize(rdd.take(163))

来源：https://stackoverflow.com/questions/34213846/pyspark-rdd-collect-first-163-rows

标签

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!