View RDD contents in Python Spark?

后端 未结 6 932
醉酒成梦
醉酒成梦 2020-11-29 03:40

Running a simple app in pyspark.

f = sc.textFile(\"README.md\")
wc = f.flatMap(lambda x: x.split(\' \')).map(lambda x: (x, 1)).reduceByKey(add)
6条回答
  •  无人及你
    2020-11-29 04:22

    Try this:

    data = f.flatMap(lambda x: x.split(' '))
    map = data.map(lambda x: (x, 1))
    mapreduce = map.reduceByKey(lambda x,y: x+y)
    result = mapreduce.collect()
    

    Please note that when you run collect(), the RDD - which is a distributed data set is aggregated at the driver node and is essentially converted to a list. So obviously, it won't be a good idea to collect() a 2T data set. If all you need is a couple of samples from your RDD, use take(10).

提交回复
热议问题