Spark Scala 2.10 tuple limit

无人久伴 提交于 2019-11-30 21:52:11

If all you do is modifying values from an existing DataFrame it is better to use an UDF instead of mapping over a RDD:

import org.apache.spark.sql.functions.udf

val modifyUdf = udf(modify)
data.withColumn("c1", modifyUdf($"c1"))

If for some reason above doesn't fit your needs the simplest thing you can do is to recreateDataFrame from a RDD[Row]. for example like this:

import org.apache.spark.rdd.RDD
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructField, StructType, IntegerType}


val result: RDD[Row] = data.map(row => {
  val buffer = ArrayBuffer.empty[Any]

  // Add value to buffer
  buffer.append(modify(row.getAs[String]("c1")))

  // ... repeat for other values

  // Build row
  Row.fromSeq(buffer)
})

// Create schema
val schema = StructType(Seq(
  StructField("c1", StringType, false),
  // ...  
  StructField("c66", StringType, false)
))

sqlContext.createDataFrame(result, schema)

The way around it is pretty fiddly, but it does work, try this sample code to get you started, you can see there are more than 22 columns being accessed:

object SimpleApp {
  class Record(val x1: String, val x2: String, val x3: String, ... val x24:String) extends Product with Serializable {
    def canEqual(that: Any) = that.isInstanceOf[Record]

    def productArity = 24

    def productElement(n: Int) = n match {
      case 0 => x1
      case 1 => x2
      case 2 => x3
      ...
      case 23 => x24
    }
  }

  def main(args: Array[String]) {

    val conf = new SparkConf().setAppName("Product Test")
    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc);

    val record = new Record("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x")

    import sqlContext._
    sc.parallelize(record :: Nil).registerAsTable("records")

    sql("SELECT x1 FROM records").collect()
  }
}
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!