Merge Spark output CSV files with a single header

后端 未结 6 819
天命终不由人
天命终不由人 2021-01-01 11:40

I want to create a data processing pipeline in AWS to eventually use the processed data for Machine Learning.

I have a Scala script that takes raw data from S3, proc

6条回答
  •  孤独总比滥情好
    2021-01-01 11:59

    Try to specify the schema of the header and read all file from the folder using the option drop malformed of spark-csv. This should let you read all the files in the folder keeping only the headers (because you drop the malformed). Example:

    val headerSchema = List(
      StructField("example1", StringType, true),
      StructField("example2", StringType, true),
      StructField("example3", StringType, true)
    )
    
    val header_DF =sqlCtx.read
      .option("delimiter", ",")
      .option("header", "false")
      .option("mode","DROPMALFORMED")
      .option("inferSchema","false")
      .schema(StructType(headerSchema))
      .format("com.databricks.spark.csv")
      .load("folder containg the files")
    

    In header_DF you will have only the rows of the headers, from this you can trasform the dataframe the way you need.

提交回复
热议问题