Spark SQL get max & min dynamically from datasource

我们两清 提交于 2019-11-30 09:29:55

问题


I am using Spark SQL where I want to fetch whole data everyday from a Oracle table(consist of more than 1800k records). The application is hanging up when I read from Oracle hence I used concept of partitionColumn,lowerBound & upperBound. But,the problem is how can I get lowerBound & upperBound value of primary key column dynamically?? Every day value of lowerBound & upperBound will be changing.Hence how can I get the boundary values of primary key column dynamically?? Can anyone guide me an sample example for my problem?


回答1:


Just fetch required values from the database:

url = ...
properties = ...
partition_column = ...
table = ...

# Push aggregation to the database
query = "(SELECT min({0}), max({0}) FROM {1}) AS tmp".format(
    partition_column, table
)

(lower_bound, upper_bound) = (spark.read
    .jdbc(url=url, table=query. properties=properties)
    .first())

and pass to the main query:

num_partitions = ...

spark.read.jdbc(
    url, table, 
    column=partition_column, 
    # Make upper bound inclusive 
    lowerBound=lower_bound, upperBound=upper_bound + 1, 
    numPartitions=num_partitions, properties=properties
)


来源:https://stackoverflow.com/questions/41445537/spark-sql-get-max-min-dynamically-from-datasource

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!