Trying to deserialize Avro in Spark with specific type

余生长醉 提交于 2019-12-11 04:48:59

问题


I have some Avro classes that i generated, and am now trying to use them in Spark. So I imported my avro generated java class, “twitter_schema”, and refer to it when I deserialize. Seems to work but getting a Cast exception at the end.

My Schema:

$ more twitter.avsc

{ "type" : "record", "name" : "twitter_schema", "namespace" : "com.miguno.avro", "fields" : [ { "name" : "username", "type" : "string", "doc" : "Name of the user account on Twitter.com" }, { "name" : "tweet", "type" : "string", "doc" : "The content of the user's Twitter message" }, { "name" : "timestamp", "type" : "long", "doc" : "Unix epoch time in seconds" } ], "doc:" : "A basic schema for storing Twitter messages" }

My code:

import org.apache.avro.mapreduce.AvroKeyInputFormat
import org.apache.avro.mapred.AvroKey
import org.apache.hadoop.io.NullWritable
import org.apache.avro.mapred.AvroInputFormat
import org.apache.avro.mapred.AvroWrapper
import org.apache.avro.file.DataFileReader;
import org.apache.avro.file.DataFileWriter;
import org.apache.avro.io.DatumReader;
import org.apache.avro.io.DatumWriter;
import org.apache.avro.specific.SpecificDatumReader;
import com.miguno.avro.twitter_schema

val path = "/app/avro/data/twitter.avro"
val conf = new Configuration
var avroRDD = sc.newAPIHadoopFile(path,classOf[AvroKeyInputFormat[twitter_schema]], 
classOf[AvroKey[ByteBuffer]], classOf[NullWritable], conf)
var avroRDD = sc.hadoopFile(path,classOf[AvroInputFormat[twitter_schema]], 
classOf[AvroWrapper[twitter_schema]], classOf[NullWritable], 5)

avroRDD.map(l => { 
      //transformations here
      new String(l._1.datum.username)
}
).first  

And I get an error on the last line:

scala> avroRDD.map(l => { 
     |       new String(l._1.datum.username)}).first
<console>:30: error: overloaded method constructor String with alternatives:
  (x$1: StringBuilder)String <and>
  (x$1: StringBuffer)String <and>
  (x$1: Array[Byte])String <and>
  (x$1: Array[Char])String <and>
  (x$1: String)String
 cannot be applied to (CharSequence)
                    new String(l._1.datum.username)}).first

What am I doing wrong – not understanding the error? Is it the right way of deserializing? I read about Kryo but seems to add to the complexity, and read about the Spark SQL context accepting Avro in 1.2, but it sounds like a performance hog/workaround.. Best practices for this anyone?

thanks, Matt


回答1:


I think your problem is that avro has deserialized string into CharSequence but spark expected java String. Avro has 3 ways to deserialize string in java: into CharSequence, into String and into UTF8 (avro class for storing strings, kinda like Hadoop's Text).

You control that by adding "avro.java.string" property into your avro schema. Possible values are (case sensitive): "String", "CharSequence", "Utf8". There may be a way to control that dynamically through the input format as well but I don't know exactly.




回答2:


Ok since CharSequence is the interface to String, i can keep my Avro schema the way it was, and just make my Avro string a String via toString(), i.e.:

scala> avroRDD.map(l => {
     | new String(l._1.datum.get("username").toString())
     | } ).first
res2: String = miguno


来源:https://stackoverflow.com/questions/27827649/trying-to-deserialize-avro-in-spark-with-specific-type

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!