What is the difference between sort and orderBy functions in Spark

前端 未结 3 1730
礼貌的吻别
礼貌的吻别 2020-12-15 05:06

What is the difference between sort and orderBy spark DataFrame?

scala> zips.printSchema
root
 |-- _id: string (nullable = true)
 |-- city: string (nullab         


        
3条回答
  •  猫巷女王i
    2020-12-15 05:34

    sort() function sorts the output in each bucket by the given columns on the file system. It does not guaranty the order of output data. Whereas The orderBy() happens in two phase .

    First inside each bucket using sortBy() then entire data has to be brought into a single executer for over all order in ascending order or descending order based on the specified column. It involves high shuffling and is a costly operation. But as

    The sort() operation happen inside each an individual bucket and is a light weight operation.

    Here is a example:

    Preparing data

    >>> listOfTuples = [(16,5000),(10,3000),(13,2600),(19,1800),(11,4000),(17,3100),(14,2500),(20,2000)]
    >>> tupleRDD = sc.parallelize(listOfTuples,2)
    >>> tupleDF = tupleRDD.toDF(["Id","Salary"])
    

    The data looks like :

    >>> tupleRDD.glom().collect()
    [[(16, 5000), (10, 3000), (13, 2600), (19, 1800)], [(11, 4000), (17, 3100), (14, 2500), (20, 2000)]]
    >>> tupleDF.show()
    +---+------+
    | Id|Salary|
    +---+------+
    | 16|  5000|
    | 10|  3000|
    | 13|  2600|
    | 19|  1800|
    | 11|  4000|
    | 17|  3100|
    | 14|  2500|
    | 20|  2000|
    +---+------+
    

    Now the sort operation will be

    >>> tupleDF.sort("id").show()
    +---+------+
    | Id|Salary|
    +---+------+
    | 10|  3000|
    | 11|  4000|
    | 13|  2600|
    | 14|  2500|
    | 16|  5000|
    | 17|  3100|
    | 19|  1800|
    | 20|  2000|
    +---+------+
    

    See, the order is not as expected. Now if we see the orederBy operation :

    >>> tupleDF.orderBy("id").show()
    +---+------+
    | Id|Salary|
    +---+------+
    | 10|  3000|
    | 11|  4000|
    | 13|  2600|
    | 14|  2500|
    | 16|  5000|
    | 17|  3100|
    | 19|  1800|
    | 20|  2000|
    +---+------+
    

    It maintains the overall order of data.

提交回复
热议问题