Why listing leaf files and directories is taking too much time to start in pyspark

问题

I have spark application which read multiple s3 files and do certain transformation. This is how I am reading the files:

input_df_s3_path = spark.read.csv("s3a://bucket1/s3_path.csv")
s3_path_list = input_df_s3_path.select('_c0').rdd.map(lambda row : row[0]).collect()
input_df = sqlContext.read.option("mergeSchema", "false").parquet(*s3_path_list).na.drop()

So creating a datafrme from a csv which consists all the s3 path, converting those paths into a list and passing that list in read.parquet. I have almost 50k files to be read.

In the application log I am seeing something abnormal, there is almost 15mins delay in starting Listing leaf files and directories.

20/09/09 05:56:34 INFO BlockManagerInfo: Removed broadcast_0_piece0 on ip-10-33-89-205.ec2.internal:37391 in memory (size: 26.3 KB, free: 1643.2 MB)
20/09/09 06:11:06 INFO InMemoryFileIndex: Listing leaf files and directories in parallel under: s3a://bucketname/.....

Can anyone help me in understanding why there is 15 mins delay and some efficient way to read these files?

来源：https://stackoverflow.com/questions/63808198/why-listing-leaf-files-and-directories-is-taking-too-much-time-to-start-in-pyspa

标签

python

amazon-web-services

apache-spark

pyspark

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!