Spark union fails with nested JSON dataframe

杀马特。学长 韩版系。学妹 提交于 2019-12-05 11:14:53

If you try to union the 2 dataframes you will get this :

error:org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. ArrayType(StringType,true) <> ArrayType(StructType(StructField(d1,StringType,true), StructField(d2,StringType,true)),true) at the second column of the second table

Json files arrive at the same time

To solve this problem, if you can read the JSON at the same time, I would suggest :

val jsonDf1 = spark.read.json("json1.json", "json2.json")

This will give this schema:

jsonDf1.printSchema
 |-- age: string (nullable = true)
 |-- details: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- d1: long (nullable = true)
 |    |    |-- d2: long (nullable = true)
 |-- name: string (nullable = true)

The data output

jsonDf1.show(10,truncate = false)
+---+-------+------+
|age|details|name  |
+---+-------+------+
|32 |[[1,2]]|Agent1|
|42 |null   |Agent2|
+---+-------+------+

Json files arrive at different times

If your json arrive at different times, as a default solution, I would recommend to read a template JSON object with a full array, that will make your dataframe with a possible empty array valid for any union. Then, you will remove with a filter this fake JSON before outputting the result:

val df = spark.read.json("jsonWithMaybeAnEmptyArray.json", 
"TemplateFakeJsonWithAFullArray.json")

df.filter($"name" !== "FakeAgent").show(1)

Please note : A Jira card has been opened to improve capability to merge SQL data types: https://issues.apache.org/jira/browse/SPARK-19536 and this kind of operation should be possible in the next Spark version.

polomarcus's answer led me to this solution: I couldn't read all the files at once because I got a list of files as input, and spark didn't have an API that receives a list of paths, but apparently with Scala it's possible to do this:

val files = List("path1", "path2", "path3")
val dataframe = spark.read.json(files: _*)

This way I got one dataframe containing all three files.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!