发表新帖

发表新帖

Why is no map function for dataframe in pyspark while the spark equivalent has it?

后端未结

关注

 1  1489

Currently working on PySpark. There is no map function on DataFrame, and one has to go to RDD for map function. In Scala there is a

相关标签:

1条回答

走了就别回头了

2020-12-20 19:01
Dataset.map is not part of the DataFrame (Dataset[Row]) API. It transforms strongly typed Dataset[T] into strongly typed Dataset[U]:
```
def map[U](func: (T) ⇒ U)(implicit arg0: Encoder[U]): Dataset[U] 
```
and there is simply no place for Python in the strongly typed Dataset world. In general, Datasets are native JVM objects (unlike RDD it has not Python specific implementation) which depend heavily on rich Scala type system (even Java API is severely limited). Even if Python implemented some variant of the Encoder API, data would still have to be converted to RDD for computations.

In contrast Python implements its own map like mechanism with vectorized udfs, which should be released in Spark 2.3. It is focused on high performance serde implementation coupled with Pandas API.

That includes both typical udfs (in particular SCALAR and SCALAR_ITER variants) as well as map-like variants - GROUPED_MAP and MAP_ITER applied through GroupedData.apply and DataFrame.mapInPandas (Spark >= 3.0.0) respectively.
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题