How to add a new column to a Spark RDD?

匿名 (未验证) 提交于 2019-12-03 03:10:03

问题:

I have a RDD with MANY columns (e.g., hundreds), how do I add one more column at the end of this RDD?

For example, if my RDD is like below:

    123, 523, 534, ..., 893     536, 98, 1623, ..., 98472     537, 89, 83640, ..., 9265     7297, 98364, 9, ..., 735     ......     29, 94, 956, ..., 758 

how can I add a column to it, whose value is the sum of the second and the third columns?

Thank you very much.

回答1:

You do not have to use Tuple* objects at all for adding a new column to an RDD.

It can be done by mapping each row, taking its original contents plus the elements you want to append, for example:

val rdd = ... val withAppendedColumnsRdd = rdd.map(row => {   val originalColumns = row.toSeq.toList   val secondColValue = originalColumns(1).asInstanceOf[Int]   val thirdColValue = originalColumns(2).asInstanceOf[Int]   val newColumnValue = secondColValue + thirdColValue    Row.fromSeq(originalColumns :+ newColumnValue)   // Row.fromSeq(originalColumns ++ List(newColumnValue1, newColumnValue2, ...)) // or add several new columns }) 


回答2:

you have RDD of tuple 4, apply map and convert it to tuple5

val rddTuple4RDD = ........... val rddTuple5RDD = rddTuple4RDD.map(r=> Tuple5(rddTuple4._1, rddTuple4._2, rddTuple4._3, rddTuple4._4, rddTuple4._2 + rddTuple4._3)) 


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!