发表新帖

发表新帖

Why does df.limit keep changing in Pyspark?

前端未结

关注

 2  555

自闭症患者

I\'m creating a data sample from some dataframe df with

rdd = df.limit(10000).rdd

This operation takes quite some time (why ac

相关标签:

2条回答

说谎

2020-12-16 17:44

Spark is lazy, so each action you take recalculates the data returned by limit(). If the underlying data is split across multiple partitions, then every time you evaluate it, limit might be pulling from a different partition (i.e. if your data is stored across 10 Parquet files, the first limit call might pull from file 1, the second from file 7, and so on).

0 讨论(0)
发布评论:

提交评论
- 加载中...
野的像风

2020-12-16 17:46

Because Spark is distributed, in general it's not safe to assume deterministic results. Your example is taking the "first" 10,000 rows of a DataFrame. Here, there's ambiguity (and hence non-determinism) in what "first" means. That will depend on the internals of Spark. For example, it could be the first partition that responds to the driver. That partition could change with networking, data locality, etc.

Even once you cache the data, I still wouldn't rely on getting the same data back every time, though I certainly would expect it to be more consistent than reading from disk.

0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题