Write and read raw byte arrays in Spark - using Sequence File SequenceFile

后端 未结 2 1725
一生所求
一生所求 2020-12-17 23:53

How do you write RDD[Array[Byte]] to a file using Apache Spark and read it back again?

2条回答
  •  一生所求
    2020-12-18 00:13

    Common problems seem to be getting a weird cannot cast exception from BytesWritable to NullWritable. Other common problem is BytesWritable getBytes is a totally pointless pile of nonsense which doesn't get bytes at all. What getBytes does is get your bytes than adds a ton of zeros on the end! You have to use copyBytes

    val rdd: RDD[Array[Byte]] = ???
    
    // To write
    rdd.map(bytesArray => (NullWritable.get(), new BytesWritable(bytesArray)))
      .saveAsSequenceFile("/output/path", codecOpt)
    
    // To read
    val rdd: RDD[Array[Byte]] = sc.sequenceFile[NullWritable, BytesWritable]("/input/path")
      .map(_._2.copyBytes())
    

提交回复
热议问题