Extracting `Seq[(String,String,String)]` from spark DataFrame

前端 未结 2 2005
野性不改
野性不改 2020-12-25 07:48

I have a spark DF with rows of Seq[(String, String, String)]. I\'m trying to do some kind of a flatMap with that but anything I do try ends up thro

相关标签:
2条回答
  • 2020-12-25 08:38

    Well, it doesn't claim that it is a tuple. It claims it is a struct which maps to Row:

    import org.apache.spark.sql.Row
    
    case class Feature(lemma: String, pos_tag: String, ne_tag: String)
    case class Record(id: Long, content_processed: Seq[Feature])
    
    val df = Seq(
      Record(1L, Seq(
        Feature("ancient", "jj", "o"),
        Feature("olympia_greece", "nn", "location")
      ))
    ).toDF
    
    val content = df.select($"content_processed").rdd.map(_.getSeq[Row](0))
    

    You'll find exact mapping rules in the Spark SQL programming guide.

    Since Row is not exactly pretty structure you'll probably want to map it to something useful:

    content.map(_.map {
      case Row(lemma: String, pos_tag: String, ne_tag: String) => 
        (lemma, pos_tag, ne_tag)
    })
    

    or:

    content.map(_.map ( row => (
      row.getAs[String]("lemma"),
      row.getAs[String]("pos_tag"),
      row.getAs[String]("ne_tag")
    )))
    

    Finally a slightly more concise approach with Datasets:

    df.as[Record].rdd.map(_.content_processed)
    

    or

    df.select($"content_processed").as[Seq[(String, String, String)]]
    

    although this seems to be slightly buggy at this moment.

    There is important difference the first approach (Row.getAs) and the second one (Dataset.as). The former one extract objects as Any and applies asInstanceOf. The latter one is using encoders to transform between internal types and desired representation.

    0 讨论(0)
  • 2020-12-25 08:44
    object ListSerdeTest extends App {
    
      implicit val spark: SparkSession = SparkSession
        .builder
        .master("local[2]")
        .getOrCreate()
    
    
      import spark.implicits._
      val myDS = spark.createDataset(
        Seq(
          MyCaseClass(mylist = Array(("asd", "aa"), ("dd", "ee")))
        )
      )
    
      myDS.toDF().printSchema()
    
      myDS.toDF().foreach(
        row => {
          row.getSeq[Row](row.fieldIndex("mylist"))
            .foreach {
              case Row(a, b) => println(a, b)
            }
        }
      )
    }
    
    case class MyCaseClass (
                     mylist: Seq[(String, String)]
                   )
    

    Above code is yet another way to deal with nested structure. Spark default Encoder will encode TupleX, making them nested struct, that's why you are seeing this strange behaviour. and like others said in the comment, you can't just do getAs[T]() since it's just a cast(x.asInstanceOf[T]), therefore will give you runtime exceptions.

    0 讨论(0)
提交回复
热议问题