问题
I have a spark dataframe like this:
word1 word2 co-occur
---- ----- -------
w1 w2 10
w2 w1 15
w2 w3 11
And my expected result is:
word1 word2 co-occur
---- ----- -------
w1 w2 25
w2 w3 11
I tried dataframe's groupBy
and aggregate functions but I couldn't come up with the solution.
回答1:
You need a single column containing both words in sorted order, this column can then be used for the groupBy
. You can create a new column with an array containing word1
and word
as follows:
df.withColumn("words", sort_array(array($"word1", $"word2")))
.groupBy("words")
.agg(sum($"co-occur").as("co-occur"))
This would produce the following results:
words co-occur
----- --------
["w1","w2"] 25
["w2","w3"] 11
If you would like to have both words as spearate dataframe columns, use the getItem
method afterwards. For the above example, add the following lines to the above:
df.withColumn("word1", $"words".getItem(0))
.withColumn("word2", $"words".getItem(1))
.drop($"words")
The final resultant dataFrame would look like this:
word1 word2 co-occur
---- ----- -------
w1 w2 25
w2 w3 11
来源:https://stackoverflow.com/questions/51834354/sum-one-column-values-if-other-columns-are-matched