View RDD contents in Python Spark?

后端 未结 6 930
醉酒成梦
醉酒成梦 2020-11-29 03:40

Running a simple app in pyspark.

f = sc.textFile(\"README.md\")
wc = f.flatMap(lambda x: x.split(\' \')).map(lambda x: (x, 1)).reduceByKey(add)
6条回答
  •  刺人心
    刺人心 (楼主)
    2020-11-29 04:32

    This error is because print isn't a function in Python 2.6.

    You can either define a helper UDF that performs the print, or use the __future__ library to treat print as a function:

    >>> from operator import add
    >>> f = sc.textFile("README.md")
    >>> def g(x):
    ...     print x
    ...
    >>> wc.foreach(g)
    

    or

    >>> from __future__ import print_function
    >>> wc.foreach(print)
    

    However, I think it would be better to use collect() to bring the RDD contents back to the driver, because foreach executes on the worker nodes and the outputs may not necessarily appear in your driver / shell (it probably will in local mode, but not when running on a cluster).

    >>> for x in wc.collect():
    ...     print x
    

提交回复
热议问题