select best record possible

问题

Have different files in a directory as below

f1.txt

id FName Lname Adrress sex levelId

t1 Girish Hm 10oak m 1111 

t2 Kiran Kumar 5wren m 2222

t3 sara chauhan 15nvi f 6666

f2.txt

t4 girish hm 11oak m 1111 

t5 Kiran Kumar 5wren f 2222

t6 Prakash Jha 18nvi f 3333

f3.txt

t7 Kiran Kumar 5wren f 2222

t8 Girish Hm 10oak m 1111

t9 Prakash Jha 18nvi m 3333

f4.txt

t10 Kiran Kumar 5wren f 2222 

t11 girish hm 10oak m 1111

t12 Prakash Jha 18nvi f 3333

only first name and last name constant here and case should be ignored, other Address,Sex, levelID could be changed.

Data should be grouped first based on fname and lname

t1 Girish Hm 10oak m 1111 

t4 girish hm 11oak m 1111 

t8 Girish Hm 10oak m 1111

t11 girish hm 10oak m 1111



t2 Kiran Kumar 5wren m 2222

t5 Kiran Kumar 5wren f 2222

t7 Kiran Kumar 5wren f 2222

t10 Kiran Kumar 5wren f 2222 


t3 sara chauhan 15nvi f 6666



t6 Prakash Jha 18nvi f 3333

t9 Prakash Jha 18nvi m 3333

t12 Prakash Jha 18nvi f 33

Later we need to choose appropriate first record from each group based on frequency of values of columns Address,Sex,LevelID

Example: For person Girish Hm

10oak has maximum frequency from address

m has maximum frequency from gender

1111 has maximum frequency from LevelID.

so, Id with t1 will be correct record(considering need to choose 1st appropriate record from the group)

Final output should be:

t1 Girish Hm 10oak m 1111

t5 Kiran Kumar 5wren f 2222

t3 sara chauhan 15nvi f 6666

t6 Prakash Jha 18nvi f 3333

回答1:

Scala solution:

First define columns of interest:

val cols = Array("Adrress", "sex", "levelId")

Then add an array column of each column of interest and its frequency using

df.select(
    cols.map(
        x => array(
            count(x).over(
                Window.partitionBy(
                    lower(col("FName")),
                    lower(col("LName")),
                    col(x) 
                )
            ),
            col(x)
        ).alias(x ++ "_freq")
    )
)

Then group by each person and aggregate to get the maximum frequency: (ignore the dummy agg which is due to the agg function requiring an argument and a bunch of other arguments)

.groupBy(
    lower(col("FName")).alias("FName"),
    lower(col("LName")).alias("LName"))
.agg(
    count($"*").alias("dummy"),
    cols.map(
        x => max(col(x ++ "_freq"))(1).alias(x)
    ): _*
)
.drop("dummy"))

Overall code:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window

val cols = Array("Adrress", "sex", "levelId")

val df = spark.read.option("header", "true").option("delimiter", " ").option("inferSchema", "true").csv("names.txt")

val df2 = (df
.select(col("*") +: cols.map(x => array(count(x).over(Window.partitionBy(lower(col("FName")), lower(col("LName")), col(x))), col(x)).alias(x ++ "_freq")): _*)
.groupBy(lower(col("FName")).alias("FName"), lower(col("LName")).alias("LName"))
.agg(count($"*").alias("dummy"), cols.map(x => max(col(x ++ "_freq"))(1).alias(x)): _*)
.drop("dummy"))

df2.show
+-------+-------+-------+---+-------+
|  FName|  LName|Adrress|sex|levelId|
+-------+-------+-------+---+-------+
|   sara|chauhan|  15nvi|  f|   6666|
|prakash|    jha|  18nvi|  f|   3333|
| girish|     hm|  10oak|  m|   1111|
|  kiran|  kumar|  5wren|  f|   2222|
+-------+-------+-------+---+-------+

来源：https://stackoverflow.com/questions/65102491/select-best-record-possible

标签

scala

dataframe

apache-spark

apache-spark-sql