spark dropDuplicates based on json array field

后端 未结 2 1918
长发绾君心
长发绾君心 2021-01-26 05:07

I have json files of the following structure:

{\"names\":[{\"name\":\"John\",\"lastName\":\"Doe\"},
{\"name\":\"John\",\"lastName\":\"Marcus\"},
{\"name\":\"Davi         


        
相关标签:
2条回答
  • 2021-01-26 05:54

    This seems to be a regression that was added in spark 2.0. If you bring the nested column to the highest level you can drop the duplicates. If we create a new column based on the columns you want to dedup on. Then we drop the columns and finally drop the column. The following function will work for composite keys as well.

    val columns = Seq("names.name")
    df.withColumn("DEDUP_KEY", concat_ws(",", columns:_*))
      .dropDuplicates("DEDUP_KEY")
      .drop("DEDUP_KEY")
    
    0 讨论(0)
  • 2021-01-26 06:12

    just for future reference, the solution looks like

          val uniqueNams = allNames.withColumn("DEDUP_NAME_KEY", 
    org.apache.spark.sql.functions.explode(new Column("names.name")))
    .cache()
    .dropDuplicates(Array("DEDUP_NAME_KEY"))
    .drop("DEDUP_NAME_KEY")
    
    0 讨论(0)
提交回复
热议问题