Write and read raw byte arrays in Spark - using Sequence File SequenceFile

后端 未结 2 1723
一生所求
一生所求 2020-12-17 23:53

How do you write RDD[Array[Byte]] to a file using Apache Spark and read it back again?

相关标签:
2条回答
  • 2020-12-18 00:09

    Here is a snippet with all required imports that you can run from spark-shell, as requested by @Choix

    import org.apache.hadoop.io.BytesWritable
    import org.apache.hadoop.io.NullWritable
    
    val path = "/tmp/path"
    
    val rdd = sc.parallelize(List("foo"))
    val bytesRdd = rdd.map{str  =>  (NullWritable.get, new BytesWritable(str.getBytes) )  }
    bytesRdd.saveAsSequenceFile(path)
    
    val recovered = sc.sequenceFile[NullWritable, BytesWritable]("/tmp/path").map(_._2.copyBytes())
    val recoveredAsString = recovered.map( new String(_) )
    recoveredAsString.collect()
    // result is:  Array[String] = Array(foo)
    
    0 讨论(0)
  • 2020-12-18 00:13

    Common problems seem to be getting a weird cannot cast exception from BytesWritable to NullWritable. Other common problem is BytesWritable getBytes is a totally pointless pile of nonsense which doesn't get bytes at all. What getBytes does is get your bytes than adds a ton of zeros on the end! You have to use copyBytes

    val rdd: RDD[Array[Byte]] = ???
    
    // To write
    rdd.map(bytesArray => (NullWritable.get(), new BytesWritable(bytesArray)))
      .saveAsSequenceFile("/output/path", codecOpt)
    
    // To read
    val rdd: RDD[Array[Byte]] = sc.sequenceFile[NullWritable, BytesWritable]("/input/path")
      .map(_._2.copyBytes())
    
    0 讨论(0)
提交回复
热议问题