Spark Scala 2.10 tuple limit

笑着哭i 提交于 2019-12-19 03:39:09

问题


I have DataFrame with 66 columns to process (almost each column value needs to be changed someway) so I'm running following statement

    val result = data.map(row=> (
        modify(row.getString(row.fieldIndex("XX"))),
        (...)
        )
    )

till 66th column. Since scala in this version has limit to max tuple of 22 pairs I cannot perform this like that. Question is, is there any workaround for it? After all line operations I'm converting it to df with specific column names

   result.toDf("c1",...,"c66")
   result.storeAsTempTable("someFancyResult")

"modify" function is just an example to show my point


回答1:


If all you do is modifying values from an existing DataFrame it is better to use an UDF instead of mapping over a RDD:

import org.apache.spark.sql.functions.udf

val modifyUdf = udf(modify)
data.withColumn("c1", modifyUdf($"c1"))

If for some reason above doesn't fit your needs the simplest thing you can do is to recreateDataFrame from a RDD[Row]. for example like this:

import org.apache.spark.rdd.RDD
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructField, StructType, IntegerType}


val result: RDD[Row] = data.map(row => {
  val buffer = ArrayBuffer.empty[Any]

  // Add value to buffer
  buffer.append(modify(row.getAs[String]("c1")))

  // ... repeat for other values

  // Build row
  Row.fromSeq(buffer)
})

// Create schema
val schema = StructType(Seq(
  StructField("c1", StringType, false),
  // ...  
  StructField("c66", StringType, false)
))

sqlContext.createDataFrame(result, schema)



回答2:


The way around it is pretty fiddly, but it does work, try this sample code to get you started, you can see there are more than 22 columns being accessed:

object SimpleApp {
  class Record(val x1: String, val x2: String, val x3: String, ... val x24:String) extends Product with Serializable {
    def canEqual(that: Any) = that.isInstanceOf[Record]

    def productArity = 24

    def productElement(n: Int) = n match {
      case 0 => x1
      case 1 => x2
      case 2 => x3
      ...
      case 23 => x24
    }
  }

  def main(args: Array[String]) {

    val conf = new SparkConf().setAppName("Product Test")
    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc);

    val record = new Record("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x")

    import sqlContext._
    sc.parallelize(record :: Nil).registerAsTable("records")

    sql("SELECT x1 FROM records").collect()
  }
}


来源:https://stackoverflow.com/questions/33826495/spark-scala-2-10-tuple-limit

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!