Spark job execution time exponentially increases with very wide dataset and number of columns [duplicate]

问题

I have created a fixed width file import parser in spark and performed a few execution test on various datasets. It works fine up to 1000 columns, but, as the number of columns and fixed width length increases, Spark job performance decreases rapidly. It takes a lot of time to execute on 20k columns and fixed width length more than 100 thousand.

What are the possible reasons for this? How can I improve the performance?

One of the similar issues I found:

http://apache-spark-developers-list.1001551.n3.nabble.com/Performance-Spark-DataFrame-is-slow-with-wide-data-Polynomial-complexity-on-the-number-of-columns-is-td24635.html

回答1:

If you are having more number of columns, it is better to read/convert the record as an array and use the slice function to map it to individual columns. Using substring to get individual columns will not be that efficient.

EDIT 1:

I used a Array[String] as an example by attaching it to a case class Record() in scala. You can extend it to hdfs textfiles

scala> case class Record(a1:String,a2:Int,a3:java.time.LocalDate)
defined class Record

scala>  val x = sc.parallelize(Array("abcd1232018-01-01","defg4562018-02-01"))
x: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[3] at parallelize at <console>:24

scala> val y = x.map( a => Record( a.slice(0,4), a.slice(4,4+3).toInt,java.time.LocalDate.parse(a.slice(7,7+10))))
y: org.apache.spark.rdd.RDD[Record] = MapPartitionsRDD[4] at map at <console>:27

scala> y.collect()
res3: Array[Record] = Array(Record(abcd,123,2018-01-01), Record(defg,456,2018-02-01))

scala>

来源：https://stackoverflow.com/questions/52343270/spark-job-execution-time-exponentially-increases-with-very-wide-dataset-and-numb

标签

scala

apache-spark

bigdata