Write and read raw byte arrays in Spark - using Sequence File SequenceFile

后端 未结 2 1731
一生所求
一生所求 2020-12-17 23:53

How do you write RDD[Array[Byte]] to a file using Apache Spark and read it back again?

2条回答
  •  青春惊慌失措
    2020-12-18 00:09

    Here is a snippet with all required imports that you can run from spark-shell, as requested by @Choix

    import org.apache.hadoop.io.BytesWritable
    import org.apache.hadoop.io.NullWritable
    
    val path = "/tmp/path"
    
    val rdd = sc.parallelize(List("foo"))
    val bytesRdd = rdd.map{str  =>  (NullWritable.get, new BytesWritable(str.getBytes) )  }
    bytesRdd.saveAsSequenceFile(path)
    
    val recovered = sc.sequenceFile[NullWritable, BytesWritable]("/tmp/path").map(_._2.copyBytes())
    val recoveredAsString = recovered.map( new String(_) )
    recoveredAsString.collect()
    // result is:  Array[String] = Array(foo)
    

提交回复
热议问题