Pyspark string array of dynamic length in dataframe column to onehot-encoded

前端 未结 2 1701
没有蜡笔的小新
没有蜡笔的小新 2021-01-23 17:36

I would like to convert a column which contains strings like:

 [\"ABC\",\"def\",\"ghi\"] 
 [\"Jkl\",\"ABC\",\"def\"]
 [\"Xyz\",\"ABC\"]

Into a

2条回答
  •  天命终不由人
    2021-01-23 18:26

    You can probably use CountVectorizer, Below is an example:

    Update: removed the step to drop duplicates in arrays, you can set binary=True when setting up CountVectorizer:

    from pyspark.ml.feature import CountVectorizer
    from pyspark.sql.functions import udf, col
    
    df = spark.createDataFrame([
            (["ABC","def","ghi"],)
          , (["Jkl","ABC","def"],)
          , (["Xyz","ABC"],)
        ], ['arr']
    )
    

    create the CountVectorizer model:

    cv = CountVectorizer(inputCol='arr', outputCol='c1', binary=True)
    
    model = cv.fit(df)
    
    vocabulary = model.vocabulary
    # [u'ABC', u'def', u'Xyz', u'ghi', u'Jkl']
    

    Create a UDF to convert a vector into array

    udf_to_array = udf(lambda v: v.toArray().tolist(), 'array')
    

    Get the vector and check the content:

    df1 = model.transform(df)
    
    df1.withColumn('c2', udf_to_array('c1')) \
       .select('*', *[ col('c2')[i].astype('int').alias(vocabulary[i]) for i in range(len(vocabulary))]) \
       .show(3,0)
    +---------------+-------------------------+-------------------------+---+---+---+---+---+
    |arr            |c1                       |c2                       |ABC|def|Xyz|ghi|Jkl|
    +---------------+-------------------------+-------------------------+---+---+---+---+---+
    |[ABC, def, ghi]|(5,[0,1,3],[1.0,1.0,1.0])|[1.0, 1.0, 0.0, 1.0, 0.0]|1  |1  |0  |1  |0  |
    |[Jkl, ABC, def]|(5,[0,1,4],[1.0,1.0,1.0])|[1.0, 1.0, 0.0, 0.0, 1.0]|1  |1  |0  |0  |1  |
    |[Xyz, ABC]     |(5,[0,2],[1.0,1.0])      |[1.0, 0.0, 1.0, 0.0, 0.0]|1  |0  |1  |0  |0  |
    +---------------+-------------------------+-------------------------+---+---+---+---+---+
    

提交回复
热议问题