Scala spark - Dealing with Hierarchy data tables

旧时模样 提交于 2019-12-08 05:02:58

问题


I have data table with hierarchy data model with tree structures. For example: Here is a sample data row:

-------------------------------------------
Id | name    |parentId | path       | depth
-------------------------------------------
55 | Canada  | null    | null       | 0
77 | Ontario |  55     | /55        | 1
100| Toronto |  77     | /55/77     | 2
104| Brampton| 100     | /55/77/100 | 3

I am looking to convert those rows into flattening version, sample output would be:

-----------------------------------
Id | name     | parentId | depth
------------------------------------
104| Brampton | Toronto  | 3
100| Toronto  | Ontario  | 2
77 | Ontario  | Canada   | 1
55 | Canada   | None     | 0
100| Toronto  | Ontario  | 2
77 | Ontario  | Canada   | 1
55 | Canada   | None     | 0
77 | Ontario  | Canada   | 1
55 | Canada   | None     | 0
55 | Canada   | None     | 0

I tried using cartesian or do like n2 search but none of them are working.


回答1:


Below is one way:

//Creating DF with your data
def getSeq(s:String): Seq[String] = { s.split('|').map(_.trim).toSeq }
var l = getSeq("77 | Ontario |  55     | /55        | 1") :: Nil
l :+= getSeq("55 | Canada  | null    | null       | 0")
l :+= getSeq("100| Toronto |  77     | /55/77     | 2")
l :+= getSeq("104| Brampton| 100     | /55/77/100 | 3")
val df = l.map(x => x match { case Seq(a,b,c,d,e) => (a,b,c,d,e) }).toDF("Id", "name", "parentId", "path", "depth")

//original DF with parentName using a self join
val dfWithPar = df.as("df1").join(df.as("df2"), $"df1.parentId" === $"df2.Id", "leftouter").select($"df1.Id",$"df1.name",$"df1.parentId",$"df1.path",$"df1.depth",$"df2.name".as("parentName"))

// Split path as per requirement and get the exploded DF
val dfExploded = dfWithPar.withColumn("path", regexp_replace($"path", "^/", "")).withColumn("path", split($"path","/")).withColumn("path", explode($"path"))

//Join orig with exploded to get addendum of rows as per individual path placeholders
val dfJoined = dfWithPar.join(dfExploded, dfWithPar.col("Id") === dfExploded.col("path")).select(dfWithPar.col("Id"), dfWithPar.col("name"), dfWithPar.col("parentId"), dfWithPar.col("path"), dfWithPar.col("depth"), dfWithPar.col("parentName"))

//Get the final result by adding the addendum to orig
dfWithPar.union(dfJoined).select($"Id", $"name", $"parentName", $"depth").show

+---+--------+----------+-----+
| Id|    name|parentName|depth|
+---+--------+----------+-----+
| 77| Ontario|    Canada|    1|
| 55|  Canada|      null|    0|
|100| Toronto|   Ontario|    2|
|104|Brampton|   Toronto|    3|
| 77| Ontario|    Canada|    1|
| 77| Ontario|    Canada|    1|
| 55|  Canada|      null|    0|
| 55|  Canada|      null|    0|
| 55|  Canada|      null|    0|
|100| Toronto|   Ontario|    2|
+---+--------+----------+-----+



回答2:


Self joins with conditions and selecting appropriate columns should work for you.

The solution is a bit tricky as you need to find every parent names in path column including the papentId column which would require concat_ws, split and explode inbuilt functions. The rest of the process is joins, selects and fills.

Given dataframe :

+---+--------+--------+----------+-----+
|Id |name    |parentId|path      |depth|
+---+--------+--------+----------+-----+
|55 |Canada  |null    |null      |0    |
|77 |Ontario |55      |/55       |1    |
|100|Toronto |77      |/55/77    |2    |
|104|Brampton|100     |/55/77/100|3    |
+---+--------+--------+----------+-----+

You can generate temporary dataframe for final join as

val df2 = df.as("table1")
  .join(df.as("table2"), col("table1.parentId") === col("table2.Id"), "left")
  .select(col("table1.Id").as("path"), col("table1.name").as("name"), col("table2.name").as("parentId"), col("table1.depth").as("depth"))
  .na.fill("None")
//    +----+--------+--------+-----+
//    |path|name    |parentId|depth|
//    +----+--------+--------+-----+
//    |55  |Canada  |None    |0    |
//    |77  |Ontario |Canada  |1    |
//    |100 |Toronto |Ontario |2    |
//    |104 |Brampton|Toronto |3    |
//    +----+--------+--------+-----+

And the required dataframe can be achieved by doing

df.withColumn("path", explode(split(concat_ws("", col("parentId"), col("path")), "/")))
    .as("table1")
    .join(df2.as("table2"), Seq("path"), "right")
    .select(col("table2.path").as("Id"), col("table2.name").as("name"), col("table2.parentId").as("parentId"), col("table2.depth").as("depth"))
    .na.fill("0")
  .show(false)
//    +---+--------+--------+-----+
//    |Id |name    |parentId|depth|
//    +---+--------+--------+-----+
//    |55 |Canada  |None    |0    |
//    |55 |Canada  |None    |0    |
//    |55 |Canada  |None    |0    |
//    |55 |Canada  |None    |0    |
//    |77 |Ontario |Canada  |1    |
//    |77 |Ontario |Canada  |1    |
//    |77 |Ontario |Canada  |1    |
//    |100|Toronto |Ontario |2    |
//    |100|Toronto |Ontario |2    |
//    |104|Brampton|Toronto |3    |
//    +---+--------+--------+-----+

Explanation

for |104|Brampton|100 |/55/77/100|3 | row
concat_ws("", col("parentId"), col("path")) would generate |104|Brampton|100 |100/55/77/100|3 | as you can see 100 being concatenated at the front
split(concat_ws("", col("parentId"), col("path")), "/") would generate array column as |104|Brampton|100 |[100, 55, 77, 100]|3 |
and explode(split(concat_ws("", col("parentId"), col("path")), "/")) as a whole would explode the array column into separate rows as

|104|Brampton|100     |100     |3    |
|104|Brampton|100     |55      |3    |
|104|Brampton|100     |77      |3    |
|104|Brampton|100     |100     |3    |

joins are much clearer to understand which doesn't need explanation ;)

I hope the answer is helpful




回答3:


Here is another version:

val sparkConf = new SparkConf().setAppName("pathtest").setMaster("local")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()

import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import spark.implicits._

var dfA = spark.createDataset(Seq(
  (55, "Canada", -1, "", 0),
  (77, "Ontario", 55, "/55", 1),
  (100, "Toronto", 77, "/55/77", 2),
  (104, "Brampton", 100, "/55/77/100", 3))
)
.toDF("Id", "name", "parentId", "path", "depth")


def getArray = udf((path: String) => {
  if (path.contains("/"))
    path.split("/")
  else
    Array[String](null)
})

val dfB = dfA
    .withColumn("path", getArray(col("path")))
    .withColumn("path", explode(col("path")))
    .toDF()

dfB.as("B").join(dfA.as("A"), $"B.parentId" === $"A.Id", "left")
  .select($"B.Id".as("Id"), $"B.name".as("name"), $"A.name".as("parent"), $"B.depth".as("depth"))
    .show()

I have 2 dataframes dfA and dfB which is generated from the first one. dfB is generated with an udf by exploding the array of path. Note that the trick for Canada is to return an empty Array otherwise explode will not generate a row.

dfB looks like this:

+---+--------+--------+----+-----+
| Id|    name|parentId|path|depth|
+---+--------+--------+----+-----+
| 55|  Canada|      -1|null|    0|
| 77| Ontario|      55|    |    1|
| 77| Ontario|      55|  55|    1|
|100| Toronto|      77|    |    2|
|100| Toronto|      77|  55|    2|
|100| Toronto|      77|  77|    2|
|104|Brampton|     100|    |    3|
|104|Brampton|     100|  55|    3|
|104|Brampton|     100|  77|    3|
|104|Brampton|     100| 100|    3|
+---+--------+--------+----+-----+ 

And the final results as next:

+---+--------+-------+-----+
| Id|    name| parent|depth|
+---+--------+-------+-----+
| 55|  Canada|   null|    0|
| 77| Ontario| Canada|    1|
| 77| Ontario| Canada|    1|
|100| Toronto|Ontario|    2|
|100| Toronto|Ontario|    2|
|100| Toronto|Ontario|    2|
|104|Brampton|Toronto|    3|
|104|Brampton|Toronto|    3|
|104|Brampton|Toronto|    3|
|104|Brampton|Toronto|    3|
+---+--------+-------+-----+


来源:https://stackoverflow.com/questions/49371355/scala-spark-dealing-with-hierarchy-data-tables

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!