Read XML in spark

前端 未结 3 1051
花落未央
花落未央 2020-12-19 18:46

i am trying to read xml/nested xml in pysaprk uing spark-xml jar.

df = sqlContext.read \\
  .format(\"com.databricks.spark.xml\")\\
   .option(\"rowTag\", \         


        
相关标签:
3条回答
  • 2020-12-19 19:01

    heirarchy should be rootTag and att should be rowTag as

    df = spark.read \
        .format("com.databricks.spark.xml") \
        .option("rootTag", "hierarchy") \
        .option("rowTag", "att") \
        .load("test.xml")
    

    and you should get

    +-----+------+----------------------------+
    |Order|attval|children                    |
    +-----+------+----------------------------+
    |1    |Data  |[[[1, Studyval], [2, Site]]]|
    |2    |Info  |[[[1, age], [2, gender]]]   |
    +-----+------+----------------------------+
    

    and schema

    root
     |-- Order: long (nullable = true)
     |-- attval: string (nullable = true)
     |-- children: struct (nullable = true)
     |    |-- att: array (nullable = true)
     |    |    |-- element: struct (containsNull = true)
     |    |    |    |-- Order: long (nullable = true)
     |    |    |    |-- attval: string (nullable = true)
    

    find more information on databricks xml

    0 讨论(0)
  • 2020-12-19 19:15

    Databricks has released new version to read xml to Spark DataFrame

    <dependency>
         <groupId>com.databricks</groupId>
         <artifactId>spark-xml_2.12</artifactId>
         <version>0.6.0</version>
     </dependency>
    

    Input XML file I used on this example is available at GitHub repository.

    val df = spark.read
          .format("com.databricks.spark.xml")
          .option("rowTag", "person")
          .xml("persons.xml")
    

    Schema

    root
     |-- _id: long (nullable = true)
     |-- dob_month: long (nullable = true)
     |-- dob_year: long (nullable = true)
     |-- firstname: string (nullable = true)
     |-- gender: string (nullable = true)
     |-- lastname: string (nullable = true)
     |-- middlename: string (nullable = true)
     |-- salary: struct (nullable = true)
     |    |-- _VALUE: long (nullable = true)
     |    |-- _currency: string (nullable = true)
    

    Outputs:

    +---+---------+--------+---------+------+--------+----------+---------------+
    |_id|dob_month|dob_year|firstname|gender|lastname|middlename|         salary|
    +---+---------+--------+---------+------+--------+----------+---------------+
    |  1|        1|    1980|    James|     M|   Smith|      null|  [10000, Euro]|
    |  2|        6|    1990|  Michael|     M|    null|      Rose|[10000, Dollor]|
    +---+---------+--------+---------+------+--------+----------+---------------+
    

    Note that Spark XML API has some limitations and discussed here Spark-XML API Limitations

    Hope this helps !!

    0 讨论(0)
  • 2020-12-19 19:25

    You can use Databricks jar to parse the xml to a dataframe. You can use maven or sbt to compile the dependency or you can directly use the jar with spark submit.

    pyspark --jars /home/sandipan/Downloads/spark_jars/spark-xml_2.11-0.6.0.jar
    
    df = spark.read \
        .format("com.databricks.spark.xml") \
        .option("rootTag", "SmsRecords") \
        .option("rowTag", "sms") \
        .load("/home/sandipan/Downloads/mySMS/Sms/backupinfo.xml")
    
    Schema>>> df.printSchema()
    root
     |-- address: string (nullable = true)
     |-- body: string (nullable = true)
     |-- date: long (nullable = true)
     |-- type: long (nullable = true)
    
    >>> df.select("address").distinct().count()
    530 
    

    Follow this http://www.thehadoopguy.com/2019/09/how-to-parse-xml-data-to-saprk-dataframe.html

    0 讨论(0)
提交回复
热议问题