Pyspark Replicate Row based on column value

折月煮酒 提交于 2019-11-29 12:18:01

Unfortunately you can't iterate over a Column like that. You can always use a udf, but I do have a non-udf hack solution that should work for you if you're using Spark version 2.1 or higher.

The trick is to take advantage of pyspark.sql.functions.posexplode() to get the index value. We do this by creating a string by repeating a comma Column B times. Then we split this string on the comma, and use posexplode to get the index.

df.createOrReplaceTempView("df")  # first register the DataFrame as a temp table

query = 'SELECT '\
    '`Column A`,'\
    '`Column B`,'\
    'pos AS Index '\
    'FROM ( '\
        'SELECT DISTINCT '\
        '`Column A`,'\
        '`Column B`,'\
        'posexplode(split(repeat(",", `Column B`), ",")) '\
        'FROM df) AS a '\
    'WHERE a.pos > 0'
newDF = sqlCtx.sql(query).sort("Column A", "Column B", "Index")
newDF.show()
#+--------+--------+-----+
#|Column A|Column B|Index|
#+--------+--------+-----+
#|      T1|       3|    1|
#|      T1|       3|    2|
#|      T1|       3|    3|
#|      T2|       2|    1|
#|      T2|       2|    2|
#+--------+--------+-----+

Note: You need to wrap the column names in backticks since they have spaces in them as explained in this post: How to express a column which name contains spaces in Spark SQL

You can try this:

    from pyspark.sql.window import Window
    from pyspark.sql.functions import *
    from pyspark.sql.types import ArrayType, IntegerType
    from pyspark.sql import functions as F
    df = spark.read.csv('/FileStore/tables/stack1.csv', header = 'True', inferSchema = 'True')

    w = Window.orderBy("Column A")
    df = df.select(row_number().over(w).alias("Index"), col("*"))

    n_to_array = udf(lambda n : [n] * n ,ArrayType(IntegerType()))
    df2 = df.withColumn('Column B', n_to_array('Column B'))
    df3= df2.withColumn('Column B', explode('Column B'))
    df3.show()
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!