PySpark converting a column of type 'map' to multiple columns in a dataframe

后端 未结 2 1002
走了就别回头了
走了就别回头了 2020-11-29 09:17

Input

I have a column Parameters of type map of the form:

>>> from pyspark.sql import SQLContext
>>> s         


        
2条回答
  •  一个人的身影
    2020-11-29 10:13

    Performant solution

    One of the question constraints is to dynamically determine the column names, which is fine, but be warned that this can be really slow. Here's how you can avoid typing and write code that'll execute quickly.

    cols = list(map(
        lambda f: F.col("Parameters").getItem(f).alias(str(f)),
        ["foo", "bar", "baz"]))
    df.select(cols).show()
    
    +---+---+---+
    |foo|bar|baz|
    +---+---+---+
    |  1|  2|aaa|
    +---+---+---+
    

    Notice that this runs a single select operation. Don't run withColumn multiple times because that's slower.

    The fast solution is only possible if you know all the map keys. You'll need to revert to the slower solution if you don't know all the unique values for the map keys.

    Slower solution

    The accepted answer is good. My solution is a bit more performant because it doesn't call .rdd or flatMap().

    import pyspark.sql.functions as F
    
    d = [{'Parameters': {'foo': '1', 'bar': '2', 'baz': 'aaa'}}]
    df = spark.createDataFrame(d)
    
    keys_df = df.select(F.explode(F.map_keys(F.col("Parameters")))).distinct()
    keys = list(map(lambda row: row[0], keys_df.collect()))
    key_cols = list(map(lambda f: F.col("Parameters").getItem(f).alias(str(f)), keys))
    df.select(key_cols).show()
    
    +---+---+---+
    |bar|foo|baz|
    +---+---+---+
    |  2|  1|aaa|
    +---+---+---+
    

    Collecting results to the driver node can be a performance bottleneck. It's good to execute this code list(map(lambda row: row[0], keys_df.collect())) as a separate command to make sure it's not running too slowly.

    This blog post covers this topic in more detail and teaches you how to analyze the expected performance by analyzing the logical plans.

提交回复
热议问题