Sum one column values if other columns are matched

问题

I have a spark dataframe like this:

word1  word2  co-occur
----   -----  ------- 
 w1     w2      10
 w2     w1      15
 w2     w3      11

And my expected result is:

word1  word2  co-occur
----   -----  ------- 
 w1     w2      25
 w2     w3      11

I tried dataframe's groupBy and aggregate functions but I couldn't come up with the solution.

回答1:

You need a single column containing both words in sorted order, this column can then be used for the groupBy. You can create a new column with an array containing word1 and word as follows:

df.withColumn("words", sort_array(array($"word1", $"word2")))
  .groupBy("words")
  .agg(sum($"co-occur").as("co-occur"))

This would produce the following results:

 words        co-occur
-----        --------
["w1","w2"]     25
["w2","w3"]     11

If you would like to have both words as spearate dataframe columns, use the getItem method afterwards. For the above example, add the following lines to the above:

df.withColumn("word1", $"words".getItem(0))
  .withColumn("word2", $"words".getItem(1))
  .drop($"words")

The final resultant dataFrame would look like this:

 word1  word2  co-occur
----   -----  ------- 
 w1     w2      25
 w2     w3      11

来源：https://stackoverflow.com/questions/51834354/sum-one-column-values-if-other-columns-are-matched

标签

sql

scala

apache-spark

dataframe

apache-spark-sql

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!