Spark Fixed Width File Import Large number of columns causing high Execution time

冷暖自知 提交于 2019-12-17 21:36:33

问题


I am getting the fixed width .txt source file from which I need to extract the 20K columns. As lack of libraries to process fixed width files using spark, I have developed the code which extracts the fields from fixed width text files.

Code read the text file as RDD with

sparkContext.textFile("abc.txt") 

then reads JSON schema and gets the column names and width of each column.

  • In the function I read the fixed length string and using the start and end position we use substring function to create the Array.

  • Map the function to RDD.

  • Convert the above RDD to DF and map colnames and write to Parquet.

The representative code

rdd1=spark.sparkContext.textfile("file1")

{ var now=0
 { val collector= new array[String] (ColLenghth.length) 
 val recordlength=line.length
for (k<- 0 to colLength.length -1)
 { collector(k) = line.substring(now,now+colLength(k))
 now =now+colLength(k)
 }
 collector.toSeq}


StringArray=rdd1.map(SubstrSting(_,ColLengthSeq))
#here ColLengthSeq is read from another schema file which is column lengths



StringArray.toDF("StringCol")
  .select(0 until ColCount).map(j=>$"StringCol"(j) as column_seq(j):_*)
  .write.mode("overwrite").parquet("c"\home\")

This code works fine with files with less number of columns however it takes lot of time and resources with 20K columns. As number of columns increases , it also increase the time.

If anyone has faced such issue with large number of columns. I need suggestions on performance tuning , how can I tune this Job or code

来源:https://stackoverflow.com/questions/52293806/spark-fixed-width-file-import-large-number-of-columns-causing-high-execution-tim

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!