Spark job execution time exponentially increases with very wide dataset and number of columns [duplicate]

末鹿安然 提交于 2019-11-30 20:26:53

问题


I have created a fixed width file import parser in spark and performed a few execution test on various datasets. It works fine up to 1000 columns, but, as the number of columns and fixed width length increases, Spark job performance decreases rapidly. It takes a lot of time to execute on 20k columns and fixed width length more than 100 thousand.

What are the possible reasons for this? How can I improve the performance?

One of the similar issues I found:

http://apache-spark-developers-list.1001551.n3.nabble.com/Performance-Spark-DataFrame-is-slow-with-wide-data-Polynomial-complexity-on-the-number-of-columns-is-td24635.html


回答1:


If you are having more number of columns, it is better to read/convert the record as an array and use the slice function to map it to individual columns. Using substring to get individual columns will not be that efficient.

EDIT 1:

I used a Array[String] as an example by attaching it to a case class Record() in scala. You can extend it to hdfs textfiles

scala> case class Record(a1:String,a2:Int,a3:java.time.LocalDate)
defined class Record

scala>  val x = sc.parallelize(Array("abcd1232018-01-01","defg4562018-02-01"))
x: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[3] at parallelize at <console>:24

scala> val y = x.map( a => Record( a.slice(0,4), a.slice(4,4+3).toInt,java.time.LocalDate.parse(a.slice(7,7+10))))
y: org.apache.spark.rdd.RDD[Record] = MapPartitionsRDD[4] at map at <console>:27

scala> y.collect()
res3: Array[Record] = Array(Record(abcd,123,2018-01-01), Record(defg,456,2018-02-01))

scala>


来源:https://stackoverflow.com/questions/52343270/spark-job-execution-time-exponentially-increases-with-very-wide-dataset-and-numb

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!