Splitting a dictionary in a Pyspark dataframe into individual columns

不想你离开。 提交于 2020-01-03 03:24:28

问题


I have a dataframe (in Pyspark) that has one of the row values as a dictionary:

df.show()

And it looks like:

+----+---+-----------------------------+
|name|age|info                         |
+----+---+-----------------------------+
|rob |26 |{color: red, car: volkswagen}|
|evan|25 |{color: blue, car: mazda}    |
+----+---+-----------------------------+

Based on the comments to give more:

df.printSchema()

The types are strings

root
 |-- name: string (nullable = true)
 |-- age: string (nullable = true)
 |-- dict: string (nullable = true)

Is it possible to take the keys from the dictionary (color and car) and make them columns in the dataframe, and have the values be the rows for those columns?

Expected Result:

+----+---+-----------------------------+
|name|age|color |car                   |
+----+---+-----------------------------+
|rob |26 |red   |volkswagen            |
|evan|25 |blue  |mazda                 |
+----+---+-----------------------------+

I didn't know I had to use df.withColumn() and somehow iterate through the dictionary to pick each one and then make a column out of it? I've tried to find some answers so far, but most were using Pandas, and not Spark, so I'm not sure if I can apply the same logic.


回答1:


Your strings:

"{color: red, car: volkswagen}"
"{color: blue, car: mazda}"

are not in a python friendly format. They can't be parsed using json.loads, nor can it be evaluated using ast.literal_eval.

However, if you knew the keys ahead of time and can assume that the strings are always in this format, you should be able to use pyspark.sql.functions.regexp_extract:

For example:

from pyspark.sql.functions import regexp_extract

df.withColumn("color", regexp_extract("info", "(?<=color: )\w+(?=(,|}))", 0))\
    .withColumn("car", regexp_extract("info", "(?<=car: )\w+(?=(,|}))", 0))\
    .show(truncate=False)
#+----+---+-----------------------------+-----+----------+
#|name|age|info                         |color|car       |
#+----+---+-----------------------------+-----+----------+
#|rob |26 |{color: red, car: volkswagen}|red  |volkswagen|
#|evan|25 |{color: blue, car: mazda}    |blue |mazda     |
#+----+---+-----------------------------+-----+----------+

The pattern is:

  • (?<=color: ): A positive look-behind for the literal string "color: "
  • \w+: One or more word characters
  • (?=(,|})): A positive look-ahead for either a literal comma or close curly brace.

Here is how to generalize this for more than two keys, and handle the case where the key does not exist in the string.

from pyspark.sql.functions import regexp_extract, when, col
from functools import reduce

keys = ["color", "car", "year"]
pat = "(?<=%s: )\w+(?=(,|}))"

df = reduce(
    lambda df, c: df.withColumn(
        c,
        when(
            col("info").rlike(pat%c),
            regexp_extract("info", pat%c, 0)
        )
    ),
    keys,
    df
)

df.drop("info").show(truncate=False)
#+----+---+-----+----------+----+
#|name|age|color|car       |year|
#+----+---+-----+----------+----+
#|rob |26 |red  |volkswagen|null|
#|evan|25 |blue |mazda     |null|
#+----+---+-----+----------+----+

In this case, we use pyspark.sql.functions.when and pyspark.sql.Column.rlike to test to see if the string contains the pattern, before we try to extract the match.


If you don't know the keys ahead of time, you'll either have to write your own parser or try to modify the data upstream.



来源:https://stackoverflow.com/questions/53072138/splitting-a-dictionary-in-a-pyspark-dataframe-into-individual-columns

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!