Pyspark 2.4.0, read avro from kafka with read stream - Python

前端 未结 1 1407
北荒
北荒 2020-12-03 19:35

I am trying to read avro messages from Kafka, using PySpark 2.4.0.

The spark-avro external module can provide this solution for reading avro files:

d         


        
1条回答
  •  醉梦人生
    2020-12-03 20:06

    You can include spark-avro package, for example using --packages (adjust versions to match spark installation):

    bin/pyspark --packages org.apache.spark:spark-avro_2.11:2.4.0
    

    and provide your own wrappers:

    from pyspark.sql.column import Column, _to_java_column 
    
    def from_avro(col, jsonFormatSchema): 
        sc = SparkContext._active_spark_context 
        avro = sc._jvm.org.apache.spark.sql.avro
        f = getattr(getattr(avro, "package$"), "MODULE$").from_avro
        return Column(f(_to_java_column(col), jsonFormatSchema)) 
    
    
    def to_avro(col): 
        sc = SparkContext._active_spark_context 
        avro = sc._jvm.org.apache.spark.sql.avro
        f = getattr(getattr(avro, "package$"), "MODULE$").to_avro
        return Column(f(_to_java_column(col))) 
    

    Example usage (adopted from the official test suite):

    from pyspark.sql.functions import col, struct
    
    
    avro_type_struct = """
    {
      "type": "record",
      "name": "struct",
      "fields": [
        {"name": "col1", "type": "long"},
        {"name": "col2", "type": "string"}
      ]
    }"""
    
    
    df = spark.range(10).select(struct(
        col("id"),
        col("id").cast("string").alias("id2")
    ).alias("struct"))
    avro_struct_df = df.select(to_avro(col("struct")).alias("avro"))
    avro_struct_df.show(3)
    
    +----------+
    |      avro|
    +----------+
    |[00 02 30]|
    |[02 02 31]|
    |[04 02 32]|
    +----------+
    only showing top 3 rows
    
    avro_struct_df.select(from_avro("avro", avro_type_struct)).show(3)
    
    +------------------------------------------------+
    |from_avro(avro, struct)|
    +------------------------------------------------+
    |                                          [0, 0]|
    |                                          [1, 1]|
    |                                          [2, 2]|
    +------------------------------------------------+
    only showing top 3 rows
    

    0 讨论(0)
提交回复
热议问题