发表新帖

发表新帖

Merge Spark output CSV files with a single header

后端未结

关注

 6  819

天命终不由人 2021-01-01 11:40

I want to create a data processing pipeline in AWS to eventually use the processed data for Machine Learning.

I have a Scala script that takes raw data from S3, proc

6条回答

孤独总比滥情好 (楼主)

2021-01-01 11:59
Try to specify the schema of the header and read all file from the folder using the option drop malformed of spark-csv. This should let you read all the files in the folder keeping only the headers (because you drop the malformed). Example:
```
val headerSchema = List(
  StructField("example1", StringType, true),
  StructField("example2", StringType, true),
  StructField("example3", StringType, true)
)

val header_DF =sqlCtx.read
  .option("delimiter", ",")
  .option("header", "false")
  .option("mode","DROPMALFORMED")
  .option("inferSchema","false")
  .schema(StructType(headerSchema))
  .format("com.databricks.spark.csv")
  .load("folder containg the files")
```
In header_DF you will have only the rows of the headers, from this you can trasform the dataframe the way you need.
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...

热议问题