Currently working on PySpark. There is no map function on DataFrame
, and one has to go to RDD
for map
function. In Scala there is a
Dataset.map
is not part of the DataFrame
(Dataset[Row]
) API. It transforms strongly typed Dataset[T]
into strongly typed Dataset[U]
:
def map[U](func: (T) ⇒ U)(implicit arg0: Encoder[U]): Dataset[U]
and there is simply no place for Python in the strongly typed Dataset
world. In general, Datasets
are native JVM objects (unlike RDD
it has not Python specific implementation) which depend heavily on rich Scala type system (even Java API is severely limited). Even if Python implemented some variant of the Encoder
API, data would still have to be converted to RDD
for computations.
In contrast Python implements its own map
like mechanism with vectorized udfs, which should be released in Spark 2.3. It is focused on high performance serde implementation coupled with Pandas API.
That includes both typical udfs
(in particular SCALAR
and SCALAR_ITER
variants) as well as map-like variants - GROUPED_MAP
and MAP_ITER
applied through GroupedData.apply
and DataFrame.mapInPandas
(Spark >= 3.0.0) respectively.