Access dependencies available in Scala but no PySpark

问题

I am trying to access the dependencies of an RDD. In Scala it is a pretty simple code:

scala> val myRdd = sc.parallelize(0 to 9).groupBy(_ % 2)
myRdd: org.apache.spark.rdd.RDD[(Int, Iterable[Int])] = ShuffledRDD[2] at groupBy at <console>:24

scala> myRdd.dependencies
res0: Seq[org.apache.spark.Dependency[_]] = List(org.apache.spark.ShuffleDependency@6c427386)

But dependencies is not available in PySpark. Any pointers on how I can access them?

>>> myRdd.dependencies
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'PipelinedRDD' object has no attribute 'dependencies'

回答1:

There is no supported way to do it, because it is not that meaningful. You can

rdd = sc.parallelize([1, 2, 3]).map(lambda x: x)
deps = sc._jvm.org.apache.spark.api.java.JavaRDD.toRDD(rdd._jrdd).dependencies()
print(deps)
## List(org.apache.spark.OneToOneDependency@63b86b0d)

for i in range(deps.size()):
    print(deps.apply(i))

## org.apache.spark.OneToOneDependency@63b86b0d

but I don't think it will get you far.

来源：https://stackoverflow.com/questions/47581681/access-dependencies-available-in-scala-but-no-pyspark

标签

python

apache-spark

rdd

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!