Expand array-of-structs into columns in PySpark

江枫思渺然 提交于 2019-12-04 21:40:16

Joins are very costly operation because it results in data shuffling. If you can, you should avoid it or look to optimize it.

There are two joins in your code. The last join get the columns back can be avoided altogether. The other join with metadata dataframe can be optimized. Since metadata df has only 250 rows and is very, you can use broadcast() hint in the join. This would avoid shuffling of the larger dataframe.

I have made some suggested code changes but its not tested since I don't have your data.

# df columns list
df_columns = df.columns

# Explode customDimensions so that each row now has a {index, value}
cd = df.withColumn('customDimensions', F.explode(cd.customDimensions))

# Put the index and value into their own columns
cd = cd.select(*df_columns, 'customDimensions.index', 'customDimensions.value')

# Join with metadata to obtain the name from the index
metadata = metadata.select('index', 'name')
cd = cd.join(broadcast(metadata), "index", 'left')

# Pivot cd so that each row has the id, and we have columns for each custom dimension
piv = cd.groupBy(df_columns).pivot('name').agg(F.first(F.col('value')))


return piv
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!