How to Split rows to different columns in Spark DataFrame/DataSet?

老子叫甜甜 提交于 2021-01-28 02:18:31

问题


Suppose I have data set like :

Name | Subject | Y1  | Y2 
A    | math    | 1998| 2000
B    |         | 1996| 1999
     | science | 2004| 2005

I want to split rows of this data set such that Y2 column will be eliminated like :

Name | Subject | Y1
A    | math    | 1998
A    | math    | 1999
A    | math    | 2000
B    |         | 1996
B    |         | 1997
B    |         | 1998
B    |         | 1999
     | science | 2004
     | science | 2005

Can someone suggest something here ? I hope I had made my query clear. Thanks in advance.


回答1:


I think you only need to create an udf to create the range. Then you can use explode to create the necessary rows:

val createRange = udf { (yearFrom: Int, yearTo: Int) =>
    (yearFrom to yearTo).toList
}

df.select($"Name", $"Subject", functions.explode(createRange($"Y1", $"Y2"))).show()

EDIT: The python version of this code would be something like:

from pyspark.sql import Row
from pyspark.sql.functions import udf, explode
from pyspark.sql.types import IntegerType

createRange=udf( lambda (yearFrom, yearTo): list(range(yearFrom, yearTo)), IntegerType())

df.select($"Name", $"Subject", explode(createRange($"Y1", $"Y2"))).show()



回答2:


I have tested this code in pyspark and it works as expected:

data= sc.parallelize([["A","math",1998,2000],["B","",1996,1999],["","science",2004,2005]]

data.map(lambda reg: ((reg[0],reg[1]),(range(reg[2],reg[3]+1))) )
    .flatMapValues(lambda reg: reg).collect()

In more detail, you need to convert the input data to a pair RDD in the form (key,value), where key is composed with the first two fields, since the result will be flattened keeping the key intact with flatMapValues. The values to be mapped are constructed as a range from Y1 to Y2. All of this is done in the first map.

flatMapValues will return each of the values in the range associated to its key.

The output looks like this:

[(('A', 'math'), 1998),
 (('A', 'math'), 1999),
 (('A', 'math'), 2000),
 (('B', ''), 1996),
 (('B', ''), 1997),
 (('B', ''), 1998),
 (('B', ''), 1999),
 (('', 'science'), 2004),
 (('', 'science'), 2005)]



回答3:


Here is the way in which you can implement this :

  val resultantDF= df.rdd.flatMap{row =>
    val rangeInitial = row.getInt(2)
    val rangeEnd = row.getInt(3)
    val array = rangeInitial to rangeEnd
    (List.fill(array.size)(row.getString(0)),List.fill(array.size)(row.getString(1)),array).zipped.toList
    }.toDF("Name","Subject","Y1")

resultantDF.show()



回答4:


You can use spark select easily to get what you want in a Data frame, or even in RDD.

Dataset<Row> sqlDF = spark.sql("SELECT Name,Subject,Y1 FROM tableName");

if you are starting from already exesting Data frame, say users, you can use something like this:

resultDF = usersDF.select("Name","Subject","Y1");


来源:https://stackoverflow.com/questions/40586307/how-to-split-rows-to-different-columns-in-spark-dataframe-dataset

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!