I am writing a spark job using python. However, I need to read in a whole bunch of avro files.
This is the closest solution that I have found in Spark\'s example fo
Spark >= 2.4.0
You can use built-in Avro support. The API is backwards compatible with the spark-avro
package, with a few additions (most notably from_avro
/ to_avro
function).
Please note that module is not bundled with standard Spark binaries and has to be included using spark.jars.packages
or equivalent mechanism.
See also Pyspark 2.4.0, read avro from kafka with read stream - Python
Spark < 2.4.0
You can use spark-avro library. First lets create an example dataset:
import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
schema_string ='''{"namespace": "example.avro",
"type": "record",
"name": "KeyValue",
"fields": [
{"name": "key", "type": "string"},
{"name": "value", "type": ["int", "null"]}
]
}'''
schema = avro.schema.parse(schema_string)
with open("kv.avro", "w") as f, DataFileWriter(f, DatumWriter(), schema) as wrt:
wrt.append({"key": "foo", "value": -1})
wrt.append({"key": "bar", "value": 1})
Reading it using spark-csv
is as simple as this:
df = sqlContext.read.format("com.databricks.spark.avro").load("kv.avro")
df.show()
## +---+-----+
## |key|value|
## +---+-----+
## |foo| -1|
## |bar| 1|
## +---+-----+