How to get an Iterator of Rows using Dataframe in SparkSQL

后端未结

关注

 2  813

自闭症患者 2020-12-19 05:55

I have an application in SparkSQL which returns large number of rows that are very difficult to fit in memory so I will not be able to use collect function on DataFrame, is

2条回答

长情又很酷 (楼主)

2020-12-19 06:42

Actually you can just use: df.toLocalIterator, here is the reference in Spark source code:

/**
 * Return an iterator that contains all of [[Row]]s in this Dataset.
 *
 * The iterator will consume as much memory as the largest partition in this Dataset.
 *
 * Note: this results in multiple Spark jobs, and if the input Dataset is the result
 * of a wide transformation (e.g. join with different partitioners), to avoid
 * recomputing the input Dataset should be cached first.
 *
 * @group action
 * @since 2.0.0
 */
def toLocalIterator(): java.util.Iterator[T] = withCallback("toLocalIterator", toDF()) { _ =>
withNewExecutionId {
  queryExecution.executedPlan.executeToIterator().map(boundEnc.fromRow).asJava
  }
}

0 讨论(0)

查看其它2个回答