Sorting numeric String in Spark Dataset

旧时模样 提交于 2020-05-29 06:14:17

问题


Let's assume that I have the following Dataset:

+-----------+----------+
|productCode|    amount|
+-----------+----------+
|      XX-13|       300|
|       XX-1|       250|
|       XX-2|       410|
|       XX-9|        50|
|      XX-10|        35|
|     XX-100|       870|
+-----------+----------+

Where productCode is of String type and the amount is an Int.

If one will try to order this by productCode the result will be (and this is expected because of nature of String comparison):

def orderProducts(product: Dataset[Product]): Dataset[Product] = {
    product.orderBy("productCode")
}

// Output:
+-----------+----------+
|productCode|    amount|
+-----------+----------+
|       XX-1|       250|
|      XX-10|        35|
|     XX-100|       870|
|      XX-13|       300|
|       XX-2|       410|
|       XX-9|        50|
+-----------+----------+

How can I get an output ordered by Integer part of the productCode like below considering Dataset API?

+-----------+----------+
|productCode|    amount|
+-----------+----------+
|       XX-1|       250|
|       XX-2|       410|
|       XX-9|        50|
|      XX-10|        35|
|      XX-13|       300|
|     XX-100|       870|
+-----------+----------+

回答1:


Use the expression in the orderBy. Check this out:

scala> val df = Seq(("XX-13",300),("XX-1",250),("XX-2",410),("XX-9",50),("XX-10",35),("XX-100",870)).toDF("productCode", "amt")
df: org.apache.spark.sql.DataFrame = [productCode: string, amt: int]

scala> df.orderBy(split('productCode,"-")(1).cast("int")).show
+-----------+---+
|productCode|amt|
+-----------+---+
|       XX-1|250|
|       XX-2|410|
|       XX-9| 50|
|      XX-10| 35|
|      XX-13|300|
|     XX-100|870|
+-----------+---+


scala>

With window functions, you could do like

scala> df.withColumn("row1",row_number().over(Window.orderBy(split('productCode,"-")(1).cast("int")))).show(false)
18/12/10 09:25:07 WARN window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+-----------+---+----+
|productCode|amt|row1|
+-----------+---+----+
|XX-1       |250|1   |
|XX-2       |410|2   |
|XX-9       |50 |3   |
|XX-10      |35 |4   |
|XX-13      |300|5   |
|XX-100     |870|6   |
+-----------+---+----+


scala>

Note that spark complains of moving all data to single partition.



来源:https://stackoverflow.com/questions/53707368/sorting-numeric-string-in-spark-dataset

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!