How to get an Iterator of Rows using Dataframe in SparkSQL

后端 未结 2 802
自闭症患者
自闭症患者 2020-12-19 05:55

I have an application in SparkSQL which returns large number of rows that are very difficult to fit in memory so I will not be able to use collect function on DataFrame, is

相关标签:
2条回答
  • 2020-12-19 06:31

    Generally speaking transferring all the data to the driver looks a pretty bad idea and most of the time there is a better solution out there but if you really want to go with this you can use toLocalIterator method on a RDD:

    val df: org.apache.spark.sql.DataFrame = ???
    df.cache // Optional, to avoid repeated computation, see docs for details
    val iter: Iterator[org.apache.spark.sql.Row]  = df.rdd.toLocalIterator 
    
    0 讨论(0)
  • 2020-12-19 06:42

    Actually you can just use: df.toLocalIterator, here is the reference in Spark source code:

    /**
     * Return an iterator that contains all of [[Row]]s in this Dataset.
     *
     * The iterator will consume as much memory as the largest partition in this Dataset.
     *
     * Note: this results in multiple Spark jobs, and if the input Dataset is the result
     * of a wide transformation (e.g. join with different partitioners), to avoid
     * recomputing the input Dataset should be cached first.
     *
     * @group action
     * @since 2.0.0
     */
    def toLocalIterator(): java.util.Iterator[T] = withCallback("toLocalIterator", toDF()) { _ =>
    withNewExecutionId {
      queryExecution.executedPlan.executeToIterator().map(boundEnc.fromRow).asJava
      }
    }
    
    0 讨论(0)
提交回复
热议问题