What is the meaning of partitionColumn, lowerBound, upperBound, numPartitions parameters?

后端 未结 4 1832
名媛妹妹
名媛妹妹 2020-11-27 05:05

While fetching data from SQL Server via a JDBC connection in Spark, I found that I can set some parallelization parameters like partitionColumn, lowerBoun

4条回答
  •  自闭症患者
    2020-11-27 05:31

    It is simple:

    • partitionColumn is a column which should be used to determine partitions.
    • lowerBound and upperBound determine range of values to be fetched. Complete dataset will use rows corresponding to the following query:

      SELECT * FROM table WHERE partitionColumn BETWEEN lowerBound AND upperBound
      
    • numPartitions determines number of partitions to be created. Range between lowerBound and upperBound is divided into numPartitions each with stride equal to:

      upperBound / numPartitions - lowerBound / numPartitions
      

      For example if:

      • lowerBound: 0
      • upperBound: 1000
      • numPartitions: 10

      Stride is equal to 100 and partitions correspond to following queries:

      • SELECT * FROM table WHERE partitionColumn BETWEEN 0 AND 100
      • SELECT * FROM table WHERE partitionColumn BETWEEN 100 AND 200
      • ...
      • SELECT * FROM table WHERE partitionColumn BETWEEN 900 AND 1000

提交回复
热议问题