Running a simple app in pyspark.
f = sc.textFile(\"README.md\")
wc = f.flatMap(lambda x: x.split(\' \')).map(lambda x: (x, 1)).reduceByKey(add)
Try this:
data = f.flatMap(lambda x: x.split(' '))
map = data.map(lambda x: (x, 1))
mapreduce = map.reduceByKey(lambda x,y: x+y)
result = mapreduce.collect()
Please note that when you run collect(), the RDD - which is a distributed data set is aggregated at the driver node and is essentially converted to a list. So obviously, it won't be a good idea to collect() a 2T data set. If all you need is a couple of samples from your RDD, use take(10).