spark dropDuplicates based on json array field

问题

I have json files of the following structure:

{"names":[{"name":"John","lastName":"Doe"},
{"name":"John","lastName":"Marcus"},
{"name":"David","lastName":"Luis"}
]}

I want to read several such json files and distinct them based on the "name" column inside names. I tried

df.dropDuplicates(Array("names.name"))

but it didn't do the magic.

回答1:

This seems to be a regression that was added in spark 2.0. If you bring the nested column to the highest level you can drop the duplicates. If we create a new column based on the columns you want to dedup on. Then we drop the columns and finally drop the column. The following function will work for composite keys as well.

val columns = Seq("names.name")
df.withColumn("DEDUP_KEY", concat_ws(",", columns:_*))
  .dropDuplicates("DEDUP_KEY")
  .drop("DEDUP_KEY")

回答2:

just for future reference, the solution looks like

      val uniqueNams = allNames.withColumn("DEDUP_NAME_KEY", 
org.apache.spark.sql.functions.explode(new Column("names.name")))
.cache()
.dropDuplicates(Array("DEDUP_NAME_KEY"))
.drop("DEDUP_NAME_KEY")

来源：https://stackoverflow.com/questions/44505772/spark-dropduplicates-based-on-json-array-field

标签

scala

apache-spark

databricks

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!