How to Split rows to different columns in Spark DataFrame/DataSet?

问题

Suppose I have data set like :

Name | Subject | Y1  | Y2 
A    | math    | 1998| 2000
B    |         | 1996| 1999
     | science | 2004| 2005

I want to split rows of this data set such that Y2 column will be eliminated like :

Name | Subject | Y1
A    | math    | 1998
A    | math    | 1999
A    | math    | 2000
B    |         | 1996
B    |         | 1997
B    |         | 1998
B    |         | 1999
     | science | 2004
     | science | 2005

Can someone suggest something here ? I hope I had made my query clear. Thanks in advance.

回答1:

I think you only need to create an udf to create the range. Then you can use explode to create the necessary rows:

val createRange = udf { (yearFrom: Int, yearTo: Int) =>
    (yearFrom to yearTo).toList
}

df.select($"Name", $"Subject", functions.explode(createRange($"Y1", $"Y2"))).show()

EDIT: The python version of this code would be something like:

from pyspark.sql import Row
from pyspark.sql.functions import udf, explode
from pyspark.sql.types import IntegerType

createRange=udf( lambda (yearFrom, yearTo): list(range(yearFrom, yearTo)), IntegerType())

df.select($"Name", $"Subject", explode(createRange($"Y1", $"Y2"))).show()

回答2:

I have tested this code in pyspark and it works as expected:

data= sc.parallelize([["A","math",1998,2000],["B","",1996,1999],["","science",2004,2005]]

data.map(lambda reg: ((reg[0],reg[1]),(range(reg[2],reg[3]+1))) )
    .flatMapValues(lambda reg: reg).collect()

In more detail, you need to convert the input data to a pair RDD in the form (key,value), where key is composed with the first two fields, since the result will be flattened keeping the key intact with flatMapValues. The values to be mapped are constructed as a range from Y1 to Y2. All of this is done in the first map.

flatMapValues will return each of the values in the range associated to its key.

The output looks like this:

[(('A', 'math'), 1998),
 (('A', 'math'), 1999),
 (('A', 'math'), 2000),
 (('B', ''), 1996),
 (('B', ''), 1997),
 (('B', ''), 1998),
 (('B', ''), 1999),
 (('', 'science'), 2004),
 (('', 'science'), 2005)]

回答3:

Here is the way in which you can implement this :

  val resultantDF= df.rdd.flatMap{row =>
    val rangeInitial = row.getInt(2)
    val rangeEnd = row.getInt(3)
    val array = rangeInitial to rangeEnd
    (List.fill(array.size)(row.getString(0)),List.fill(array.size)(row.getString(1)),array).zipped.toList
    }.toDF("Name","Subject","Y1")

resultantDF.show()

回答4:

You can use spark select easily to get what you want in a Data frame, or even in RDD.

Dataset<Row> sqlDF = spark.sql("SELECT Name,Subject,Y1 FROM tableName");

if you are starting from already exesting Data frame, say users, you can use something like this:

resultDF = usersDF.select("Name","Subject","Y1");

来源：https://stackoverflow.com/questions/40586307/how-to-split-rows-to-different-columns-in-spark-dataframe-dataset

标签

apache-spark

spark-dataframe

apache-spark-dataset