Why is no map function for dataframe in pyspark while the spark equivalent has it?

后端 未结 1 1485
梦如初夏
梦如初夏 2020-12-20 18:06

Currently working on PySpark. There is no map function on DataFrame, and one has to go to RDD for map function. In Scala there is a

相关标签:
1条回答
  • 2020-12-20 19:01

    Dataset.map is not part of the DataFrame (Dataset[Row]) API. It transforms strongly typed Dataset[T] into strongly typed Dataset[U]:

    def map[U](func: (T) ⇒ U)(implicit arg0: Encoder[U]): Dataset[U] 
    

    and there is simply no place for Python in the strongly typed Dataset world. In general, Datasets are native JVM objects (unlike RDD it has not Python specific implementation) which depend heavily on rich Scala type system (even Java API is severely limited). Even if Python implemented some variant of the Encoder API, data would still have to be converted to RDD for computations.

    In contrast Python implements its own map like mechanism with vectorized udfs, which should be released in Spark 2.3. It is focused on high performance serde implementation coupled with Pandas API.

    That includes both typical udfs (in particular SCALAR and SCALAR_ITER variants) as well as map-like variants - GROUPED_MAP and MAP_ITER applied through GroupedData.apply and DataFrame.mapInPandas (Spark >= 3.0.0) respectively.

    0 讨论(0)
提交回复
热议问题