Spark - write Avro file

问题

What are the common practices to write Avro files with Spark (using Scala API) in a flow like this:

parse some logs files from HDFS
for each log file apply some business logic and generate Avro file (or maybe merge multiple files)
write Avro files to HDFS

I tried to use spark-avro, but it doesn't help much.

val someLogs = sc.textFile(inputPath)

val rowRDD = someLogs.map { line =>
  createRow(...)
}

val sqlContext = new SQLContext(sc)
val dataFrame = sqlContext.createDataFrame(rowRDD, schema)
dataFrame.write.avro(outputPath)

This fails with error:

org.apache.spark.sql.AnalysisException: 
      Reference 'StringField' is ambiguous, could be: StringField#0, StringField#1, StringField#2, StringField#3, ...

回答1:

Databricks provided library spark-avro, which helps us in reading and writing Avro data.

dataframe.write.format("com.databricks.spark.avro").save(outputPath)

回答2:

Spark 2 and Scala 2.11

import com.databricks.spark.avro._
import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder().master("local").getOrCreate()

// Do all your operations and save it on your Dataframe say (dataFrame)

dataFrame.write.avro("/tmp/output")

Maven dependency

<dependency>
    <groupId>com.databricks</groupId>
    <artifactId>spark-avro_2.11</artifactId>
    <version>4.0.0</version> 
</dependency>

回答3:

You need to start spark shell to include avro package..recommended for lower versions

$SPARK_HOME/bin/spark-shell --packages com.databricks:spark-avro_2.11:4.0.0

Then use to df to write as avro file-

dataframe.write.format("com.databricks.spark.avro").save(outputPath)

And write as avro table in hive -

dataframe.write.format("com.databricks.spark.avro").saveAsTable(hivedb.hivetable_avro)

来源：https://stackoverflow.com/questions/33878433/spark-write-avro-file

标签

apache-spark

avro