How to create a DataFrame out of rows while retaining existing schema?

前端未结

关注

 4  1721

太阳男子 2020-12-18 15:34

If I call map or mapPartition and my function receives rows from PySpark what is the natural way to create either a local PySpark or Pandas DataFrame? Something

4条回答

渐次进展 (楼主)

2020-12-18 16:23
Spark >= 2.3.0

Since Spark 2.3.0 it is possible to use Pandas Series or DataFrame by partition or group. See for example:
- Applying UDFs on GroupedData in PySpark (with functioning python example)
- Efficient string suffix detection
Spark < 2.3.0

what is the natural way to create either a local PySpark

There is no such thing. Spark distributed data structures cannot be nested or you prefer another perspective you cannot nest actions or transformations.

or Pandas DataFrame

It is relatively easy but you have to remember at least few things:
- Pandas and Spark DataFrames are not even remotely equivalent. These are different structures, with different properties and in general you cannot replace one with another.
- Partitions can be empty.
- It looks like you're passing dictionaries. Remember that base Python dictionary is unordered (unlike collections.OrderedDict for example). So passing columns may not work as expected.
```
import pandas as pd

rdd = sc.parallelize([
    {"x": 1, "y": -1}, 
    {"x": -3, "y": 0},
    {"x": -0, "y": 4}
])

def combine(iter):
    rows = list(iter)
    return [pd.DataFrame(rows)] if rows else []

rdd.mapPartitions(combine).first()
##    x  y
## 0  1 -1
```
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...