问题
Have different files in a directory as below
f1.txt
id FName Lname Adrress sex levelId
t1 Girish Hm 10oak m 1111
t2 Kiran Kumar 5wren m 2222
t3 sara chauhan 15nvi f 6666
f2.txt
t4 girish hm 11oak m 1111
t5 Kiran Kumar 5wren f 2222
t6 Prakash Jha 18nvi f 3333
f3.txt
t7 Kiran Kumar 5wren f 2222
t8 Girish Hm 10oak m 1111
t9 Prakash Jha 18nvi m 3333
f4.txt
t10 Kiran Kumar 5wren f 2222
t11 girish hm 10oak m 1111
t12 Prakash Jha 18nvi f 3333
only first name and last name constant here and case should be ignored, other Address,Sex, levelID could be changed.
Data should be grouped first based on fname and lname
t1 Girish Hm 10oak m 1111
t4 girish hm 11oak m 1111
t8 Girish Hm 10oak m 1111
t11 girish hm 10oak m 1111
t2 Kiran Kumar 5wren m 2222
t5 Kiran Kumar 5wren f 2222
t7 Kiran Kumar 5wren f 2222
t10 Kiran Kumar 5wren f 2222
t3 sara chauhan 15nvi f 6666
t6 Prakash Jha 18nvi f 3333
t9 Prakash Jha 18nvi m 3333
t12 Prakash Jha 18nvi f 33
Later we need to choose appropriate first record from each group based on frequency of values of columns Address,Sex,LevelID
Example: For person Girish Hm
10oak has maximum frequency from address
m has maximum frequency from gender
1111 has maximum frequency from LevelID.
so, Id with t1 will be correct record(considering need to choose 1st appropriate record from the group)
Final output should be:
t1 Girish Hm 10oak m 1111
t5 Kiran Kumar 5wren f 2222
t3 sara chauhan 15nvi f 6666
t6 Prakash Jha 18nvi f 3333
回答1:
Scala solution:
First define columns of interest:
val cols = Array("Adrress", "sex", "levelId")
Then add an array column of each column of interest and its frequency using
df.select(
cols.map(
x => array(
count(x).over(
Window.partitionBy(
lower(col("FName")),
lower(col("LName")),
col(x)
)
),
col(x)
).alias(x ++ "_freq")
)
)
Then group by each person and aggregate to get the maximum frequency: (ignore the dummy agg which is due to the agg function requiring an argument and a bunch of other arguments)
.groupBy(
lower(col("FName")).alias("FName"),
lower(col("LName")).alias("LName"))
.agg(
count($"*").alias("dummy"),
cols.map(
x => max(col(x ++ "_freq"))(1).alias(x)
): _*
)
.drop("dummy"))
Overall code:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val cols = Array("Adrress", "sex", "levelId")
val df = spark.read.option("header", "true").option("delimiter", " ").option("inferSchema", "true").csv("names.txt")
val df2 = (df
.select(col("*") +: cols.map(x => array(count(x).over(Window.partitionBy(lower(col("FName")), lower(col("LName")), col(x))), col(x)).alias(x ++ "_freq")): _*)
.groupBy(lower(col("FName")).alias("FName"), lower(col("LName")).alias("LName"))
.agg(count($"*").alias("dummy"), cols.map(x => max(col(x ++ "_freq"))(1).alias(x)): _*)
.drop("dummy"))
df2.show
+-------+-------+-------+---+-------+
| FName| LName|Adrress|sex|levelId|
+-------+-------+-------+---+-------+
| sara|chauhan| 15nvi| f| 6666|
|prakash| jha| 18nvi| f| 3333|
| girish| hm| 10oak| m| 1111|
| kiran| kumar| 5wren| f| 2222|
+-------+-------+-------+---+-------+
来源:https://stackoverflow.com/questions/65102491/select-best-record-possible