View RDD contents in Python Spark?

后端 未结 6 928
醉酒成梦
醉酒成梦 2020-11-29 03:40

Running a simple app in pyspark.

f = sc.textFile(\"README.md\")
wc = f.flatMap(lambda x: x.split(\' \')).map(lambda x: (x, 1)).reduceByKey(add)
6条回答
  •  孤独总比滥情好
    2020-11-29 04:20

    By latest document, you can use rdd.collect().foreach(println) on the driver to display all, but it may cause memory issues on the driver, best is to use rdd.take(desired_number)

    https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html

    To print all elements on the driver, one can use the collect() method to first bring the RDD to the driver node thus: rdd.collect().foreach(println). This can cause the driver to run out of memory, though, because collect() fetches the entire RDD to a single machine; if you only need to print a few elements of the RDD, a safer approach is to use the take(): rdd.take(100).foreach(println).

提交回复
热议问题