Speed up InMemoryFileIndex for Spark SQL job with large number of input files
问题 I have an apache spark sql job (using Datasets), coded in Java, that get's it's input from between 70,000 to 150,000 files. It appears to take anywhere from 45 minutes to 1.5 hours to build the InMemoryFileIndex. There are no logs, very low network usage, and almost no CPU usage during this time. Here's a sample of what I see in the std output: 24698 [main] INFO org.spark_project.jetty.server.handler.ContextHandler - Started o.s.j.s.ServletContextHandler@32ec9c90{/static/sql,null,AVAILABLE,