Avro Schema to spark StructType

浪子不回头ぞ 提交于 2019-11-29 07:55:45

Disclaimer: It's kind of a dirty hack. It depends on a few things:

  • Python provides a lightweight Avro processing library and due to its dynamism it doesn't require typed writers
  • an empty Avro file is still a valid document
  • Spark schema can be converted to and from JSON

Following code reads an Avro schema file, creates an empty Avro file with given schema, reads it using spark-csv and outputs Spark schema as a JSON file.

import argparse
import tempfile

import avro.schema
from avro.datafile import DataFileWriter
from avro.io import DatumWriter

from pyspark import SparkContext
from pyspark.sql import SQLContext

def parse_schema(schema):
    with open(schema) as fr:
        return avro.schema.parse(open(schema).read())

def write_dummy(schema):
    tmp = tempfile.mktemp(suffix='.avro')
    with open(tmp, "w") as fw:
        writer = DataFileWriter(fw, DatumWriter(), schema)
        writer.close()
    return tmp

def write_spark_schema(path, schema):
    with open(path, 'w') as fw:
        fw.write(schema.json())


def main():
    parser = argparse.ArgumentParser(description='Avro schema converter')
    parser.add_argument('--schema')
    parser.add_argument('--output')
    args = parser.parse_args()

    sc = SparkContext('local[1]', 'Avro schema converter')
    sqlContext = SQLContext(sc)

    df = (sqlContext.read.format('com.databricks.spark.avro')
            .load(write_dummy(parse_schema(args.schema))))

    write_spark_schema(args.output, df.schema)
    sc.stop()


if __name__ == '__main__':
    main()

Usage:

bin/spark-submit --packages com.databricks:spark-avro_2.10:2.0.1 \ 
   avro_to_spark_schema.py \
   --schema path_to_avro_schema.avsc \
   --output path_to_spark_schema.json

Read schema:

import scala.io.Source
import org.apache.spark.sql.types.{DataType, StructType}

val json: String = Source.fromFile("schema.json").getLines.toList.head
val schema: StructType = DataType.fromJson(json).asInstanceOf[StructType]
hadooper

pls see if this helps, although little late. I was trying this hard for my current work. I have used schemaconverter from Databricks. I suppose, you were trying to read the avro file with the given schema.

 val schemaObj = new Schema.Parser().parse(new File(avscfilepath));
 var sparkSchema : StructType = new StructType
 import scala.collection.JavaConversions._     
 for(field <- schemaObj.getFields()){
  sparkSchema = sparkSchema.add(field.name, SchemaConverters.toSqlType(field.schema).dataType)
 }
 sparkSchema
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!