What the equivalent of OFFSET in Spark SQL?

问题

I got a result set of 100 rows using Spark SQL. I want to get final result starting from row number 6 to 15. In SQL we use OFFSET to skip rows like OFFSET 5 LIMIT 10 is used to get rows from number 6 to 15. In Spark SQL, How can I achieve the same?

回答1:

I guess SparkSQL does not support offset. So I use id as the filter condition. Each time, I only retrieve N data.

The following is my sample code:

sc = SparkContext()  
sqlContext = SQLContext(sc)

df = sqlContext.read.format('com.databricks.spark.csv')\
        .options(header='false', inferschema='true')\
        .load('your.csv')
sqlContext.registerDataFrameAsTable(df, "table")

batch_size = 10 ** 5
res = sqlContext.sql("select min(C0), max(C0) from table).collect()
index = int(res[0]._c0) - 1
N_max = int(res[0]._c1)
while index < N_max:
    prev = index
    sql = "select C0, C1, C2, C3 from table \
            where C0 > '%s' and C0 <= '%s' \
            order by C0 limit %d" % (index, index+batch_size, batch_size)
    res = sqlContext.sql(sql).collect()
    # do something ...

    if index < prev + batch_size:
        index = prev + batch_size

来源：https://stackoverflow.com/questions/42591763/what-the-equivalent-of-offset-in-spark-sql

标签

apache-spark

apache-spark-sql

spark-dataframe

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!