This question is an exact duplicate of:
I have created a fixed width file import parser in spark and performed a few execution test on various datasets. It works fine up to 1000 columns, but, as the number of columns and fixed width length increases, Spark job performance decreases rapidly. It takes a lot of time to execute on 20k columns and fixed width length more than 100 thousand.
What are the possible reasons for this? How can I improve the performance?
One of the similar issues I found:
If you are having more number of columns, it is better to read/convert the record as an array and use the slice function to map it to individual columns. Using substring to get individual columns will not be that efficient.
EDIT 1:
I used a Array[String] as an example by attaching it to a case class Record() in scala. You can extend it to hdfs textfiles
scala> case class Record(a1:String,a2:Int,a3:java.time.LocalDate)
defined class Record
scala> val x = sc.parallelize(Array("abcd1232018-01-01","defg4562018-02-01"))
x: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[3] at parallelize at <console>:24
scala> val y = x.map( a => Record( a.slice(0,4), a.slice(4,4+3).toInt,java.time.LocalDate.parse(a.slice(7,7+10))))
y: org.apache.spark.rdd.RDD[Record] = MapPartitionsRDD[4] at map at <console>:27
scala> y.collect()
res3: Array[Record] = Array(Record(abcd,123,2018-01-01), Record(defg,456,2018-02-01))
scala>
来源:https://stackoverflow.com/questions/52343270/spark-job-execution-time-exponentially-increases-with-very-wide-dataset-and-numb