Trying to deserialize Avro in Spark with specific type

问题

I have some Avro classes that i generated, and am now trying to use them in Spark. So I imported my avro generated java class, “twitter_schema”, and refer to it when I deserialize. Seems to work but getting a Cast exception at the end.

My Schema:

$ more twitter.avsc

{ "type" : "record", "name" : "twitter_schema", "namespace" : "com.miguno.avro", "fields" : [ { "name" : "username", "type" : "string", "doc" : "Name of the user account on Twitter.com" }, { "name" : "tweet", "type" : "string", "doc" : "The content of the user's Twitter message" }, { "name" : "timestamp", "type" : "long", "doc" : "Unix epoch time in seconds" } ], "doc:" : "A basic schema for storing Twitter messages" }

My code:

import org.apache.avro.mapreduce.AvroKeyInputFormat
import org.apache.avro.mapred.AvroKey
import org.apache.hadoop.io.NullWritable
import org.apache.avro.mapred.AvroInputFormat
import org.apache.avro.mapred.AvroWrapper
import org.apache.avro.file.DataFileReader;
import org.apache.avro.file.DataFileWriter;
import org.apache.avro.io.DatumReader;
import org.apache.avro.io.DatumWriter;
import org.apache.avro.specific.SpecificDatumReader;
import com.miguno.avro.twitter_schema

val path = "/app/avro/data/twitter.avro"
val conf = new Configuration
var avroRDD = sc.newAPIHadoopFile(path,classOf[AvroKeyInputFormat[twitter_schema]], 
classOf[AvroKey[ByteBuffer]], classOf[NullWritable], conf)
var avroRDD = sc.hadoopFile(path,classOf[AvroInputFormat[twitter_schema]], 
classOf[AvroWrapper[twitter_schema]], classOf[NullWritable], 5)

avroRDD.map(l => { 
      //transformations here
      new String(l._1.datum.username)
}
).first

And I get an error on the last line:

scala> avroRDD.map(l => { 
     |       new String(l._1.datum.username)}).first
<console>:30: error: overloaded method constructor String with alternatives:
  (x$1: StringBuilder)String <and>
  (x$1: StringBuffer)String <and>
  (x$1: Array[Byte])String <and>
  (x$1: Array[Char])String <and>
  (x$1: String)String
 cannot be applied to (CharSequence)
                    new String(l._1.datum.username)}).first

What am I doing wrong – not understanding the error? Is it the right way of deserializing? I read about Kryo but seems to add to the complexity, and read about the Spark SQL context accepting Avro in 1.2, but it sounds like a performance hog/workaround.. Best practices for this anyone?

thanks, Matt

回答1:

I think your problem is that avro has deserialized string into CharSequence but spark expected java String. Avro has 3 ways to deserialize string in java: into CharSequence, into String and into UTF8 (avro class for storing strings, kinda like Hadoop's Text).

You control that by adding "avro.java.string" property into your avro schema. Possible values are (case sensitive): "String", "CharSequence", "Utf8". There may be a way to control that dynamically through the input format as well but I don't know exactly.

回答2:

Ok since CharSequence is the interface to String, i can keep my Avro schema the way it was, and just make my Avro string a String via toString(), i.e.:

scala> avroRDD.map(l => {
     | new String(l._1.datum.get("username").toString())
     | } ).first
res2: String = miguno

来源：https://stackoverflow.com/questions/27827649/trying-to-deserialize-avro-in-spark-with-specific-type

标签

apache-spark

avro