Spark - write Avro file

Deadly 提交于 2019-12-07 07:03:33

问题


What are the common practices to write Avro files with Spark (using Scala API) in a flow like this:

  1. parse some logs files from HDFS
  2. for each log file apply some business logic and generate Avro file (or maybe merge multiple files)
  3. write Avro files to HDFS

I tried to use spark-avro, but it doesn't help much.

val someLogs = sc.textFile(inputPath)

val rowRDD = someLogs.map { line =>
  createRow(...)
}

val sqlContext = new SQLContext(sc)
val dataFrame = sqlContext.createDataFrame(rowRDD, schema)
dataFrame.write.avro(outputPath)

This fails with error:

org.apache.spark.sql.AnalysisException: 
      Reference 'StringField' is ambiguous, could be: StringField#0, StringField#1, StringField#2, StringField#3, ...

回答1:


Databricks provided library spark-avro, which helps us in reading and writing Avro data.

dataframe.write.format("com.databricks.spark.avro").save(outputPath)



回答2:


Spark 2 and Scala 2.11

import com.databricks.spark.avro._
import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder().master("local").getOrCreate()

// Do all your operations and save it on your Dataframe say (dataFrame)

dataFrame.write.avro("/tmp/output")

Maven dependency

<dependency>
    <groupId>com.databricks</groupId>
    <artifactId>spark-avro_2.11</artifactId>
    <version>4.0.0</version> 
</dependency>



回答3:


You need to start spark shell to include avro package..recommended for lower versions

$SPARK_HOME/bin/spark-shell --packages com.databricks:spark-avro_2.11:4.0.0

Then use to df to write as avro file-

dataframe.write.format("com.databricks.spark.avro").save(outputPath)

And write as avro table in hive -

dataframe.write.format("com.databricks.spark.avro").saveAsTable(hivedb.hivetable_avro)


来源:https://stackoverflow.com/questions/33878433/spark-write-avro-file

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!