Pyspark: count on pyspark.sql.dataframe.DataFrame takes long time

你离开我真会死。 提交于 2020-02-24 05:50:29

问题


I have a pyspark.sql.dataframe.DataFrame like the following

df.show()
+--------------------+----+----+---------+----------+---------+----------+---------+
|                  ID|Code|bool|      lat|       lon|       v1|        v2|       v3|
+--------------------+----+----+---------+----------+---------+----------+---------+
|5ac52674ffff34c98...|IDFA|   1|42.377167| -71.06994|17.422535|1525319638|36.853622|
|5ac52674ffff34c98...|IDFA|   1| 42.37747|-71.069824|17.683573|1525319639|36.853622|
|5ac52674ffff34c98...|IDFA|   1| 42.37757| -71.06942|22.287935|1525319640|36.853622|
|5ac52674ffff34c98...|IDFA|   1| 42.37761| -71.06943|19.110023|1525319641|36.853622|
|5ac52674ffff34c98...|IDFA|   1|42.377243| -71.06952|18.904774|1525319642|36.853622|
|5ac52674ffff34c98...|IDFA|   1|42.378254| -71.06948|20.772903|1525319643|36.853622|
|5ac52674ffff34c98...|IDFA|   1| 42.37801| -71.06983|18.084948|1525319644|36.853622|
|5ac52674ffff34c98...|IDFA|   1|42.378693| -71.07033| 15.64326|1525319645|36.853622|
|5ac52674ffff34c98...|IDFA|   1|42.378723|-71.070335|21.093477|1525319646|36.853622|
|5ac52674ffff34c98...|IDFA|   1| 42.37868| -71.07034|21.851894|1525319647|36.853622|
|5ac52674ffff34c98...|IDFA|   1|42.378716| -71.07029|20.583202|1525319648|36.853622|
|5ac52674ffff34c98...|IDFA|   1| 42.37872| -71.07067|19.738768|1525319649|36.853622|
|5ac52674ffff34c98...|IDFA|   1|42.379112| -71.07097|20.480911|1525319650|36.853622|
|5ac52674ffff34c98...|IDFA|   1| 42.37952|  -71.0708|20.526752|1525319651| 44.93808|
|5ac52674ffff34c98...|IDFA|   1| 42.37902| -71.07056|20.534052|1525319652| 44.93808|
|5ac52674ffff34c98...|IDFA|   1|42.380203|  -71.0709|19.921381|1525319653| 44.93808|
|5ac52674ffff34c98...|IDFA|   1| 42.37968|-71.071144| 20.12599|1525319654| 44.93808|
|5ac52674ffff34c98...|IDFA|   1|42.379696| -71.07114|18.760069|1525319655| 36.77853|
|5ac52674ffff34c98...|IDFA|   1| 42.38011| -71.07123|19.155525|1525319656| 36.77853|
|5ac52674ffff34c98...|IDFA|   1| 42.38022|  -71.0712|16.978994|1525319657| 36.77853|
+--------------------+----+----+---------+----------+---------+----------+---------+
only showing top 20 rows

If try to count

%%time
df.count()

CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 28.1 s

30241272

now if I take a subset of df the time to count is even longer.

id0 = df.first().ID  ## First ID
tmp = df.filter( (df['ID'] == id0) )

%%time
tmp.count()

CPU times: user 12 ms, sys: 0 ns, total: 12 ms
Wall time: 1min 33s
Out[6]:
3299

回答1:


Your question is very ingesting and tricky..

I tested with a large dataset in order to reproduce your behavior.

Problem Description

I tested the following two cases in a large dataset:

# Case 1
df.count() # Execution time: 37secs

# Case 2
df.filter((df['ID'] == id0)).count() #Execution time: 1.39 min

Explaination

Lets see the Physical Plan with only .count():

== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[count(1)], output=[count#38L])
+- Exchange SinglePartition
   +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], output=[count#41L])
      +- *(1) FileScan csv [] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:...], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>

Lets see the physical plan with .filter() and then .count():

== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[count(1)], output=[count#61L])
+- Exchange SinglePartition
   +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], output=[count#64L])
      +- *(1) Project
         +- *(1) Filter (isnotnull(ID#11) && (ID#11 = Muhammed MacIntyre))
            +- *(1) FileScan csv [ID#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:...], PartitionFilters: [], PushedFilters: [IsNotNull(ID), EqualTo(ID,Muhammed MacIntyre)], ReadSchema: struct<_c1:string>

Generally, Spark when counts the number of rows maps the rows with count=1 and the reduce all the mappers to create the final number of rows.

In the Case 2 Spark has first to filter and then create the partial counts for every partition and then having another stage to sum those up together. So, for the same rows, in the second case the Spark doing also the filtering, something that affects the computation time in large datasets. Spark is a framework for distributed processing and doesn't have indexes like Pandas, which could do the filtering extremely fast without passing all the rows.

Summary

In that simple case you can't do a lot of things to improve the execution time. You can try your application with different configuration settings (e.g # spark.sql.shuffle.partitions, # spark.default.parallelism, # of executors, # executor memory etc)




回答2:


This is because spark is lazily evaluated. When you call tmp.count(), that is an action step. In other words, your timing of tmp.count also includes the filter time. If you want to truly compare the two counts, try the following:

%%time
df.count()

id0 = df.first().ID  ## First ID
tmp = df.filter( (df['ID'] == id0) )
tmp.persist().show()

%%time
tmp.count()

The important component here is the tmp.persist().show() BEFORE performing the count. This performs the filter and caches the result. That way, the tmp.count() only includes the actual count time.



来源:https://stackoverflow.com/questions/60008180/pyspark-count-on-pyspark-sql-dataframe-dataframe-takes-long-time

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!