Why does df.limit keep changing in Pyspark?

∥☆過路亽.° 提交于 2019-12-03 12:17:27

Because Spark is distributed, in general it's not safe to assume deterministic results. Your example is taking the "first" 10,000 rows of a DataFrame. Here, there's ambiguity (and hence non-determinism) in what "first" means. That will depend on the internals of Spark. For example, it could be the first partition that responds to the driver. That partition could change with networking, data locality, etc.

Even once you cache the data, I still wouldn't rely on getting the same data back every time, though I certainly would expect it to be more consistent than reading from disk.

Spark is lazy, so each action you take recalculates the data returned by limit(). If the underlying data is split across multiple partitions, then every time you evaluate it, limit might be pulling from a different partition (i.e. if your data is stored across 10 Parquet files, the first limit call might pull from file 1, the second from file 7, and so on).

Once the rdd is set, it does not re-sample. It's hard to give you any concrete feedback here without seeing much of your code or data, but you can easily prove that rdds do not re-sample by going into the pyspark shell and doing the following:

>>> d = [{'name': 'Alice', 'age': 1, 'pet': 'cat'}, {'name': 'Bob', 'age': 2, 'pet': 'dog'}]
>>> df = sqlContext.createDataFrame(d)
>>> rdd = df.limit(1).rdd

Now you can just repeatedly print out the contents of the rdd with some print function

>>> def p(x):
...    print x
...

Your output will always contain the same value

>>> rdd.foreach(p)
Row(age=1, name=u'Alice', pet=u'cat')
>>> rdd.foreach(p)
Row(age=1, name=u'Alice', pet=u'cat')
>>> rdd.foreach(p)
Row(age=1, name=u'Alice', pet=u'cat')

I would advise that you either check your code or data

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!