Is there a best practice for parsing all information contained within one parent XML node?

点点圈 提交于 2021-02-05 07:49:46

问题


I'm writing a VB.NET application to parse a large XML file which is a Japanese dictionary. I'm completely new to XML parsing and don't really know what I'm doing. The whole dictionary fits between two XML tags <jmdict> and </jmdict>. The next level is the <entry>, which contains all information for the 1 million entries, including the form, pronunciation, meaning of the word and so on.

A typical entry might look like this:

<entry>
<ent_seq>1486440</ent_seq>
<k_ele>
<keb>美術</keb>
<ke_pri>ichi1</ke_pri>
<ke_pri>news1</ke_pri>
<ke_pri>nf02</ke_pri>
</k_ele>
<r_ele>
<reb>びじゅつ</reb>
<re_pri>ichi1</re_pri>
<re_pri>news1</re_pri>
<re_pri>nf02</re_pri>
</r_ele>
<sense>
<pos>&n;</pos>
<pos>&adj-no;</pos>
<gloss>art</gloss>
<gloss>fine arts</gloss>
</sense>
<sense>
<gloss xml:lang="dut">kunst</gloss>
<gloss xml:lang="dut">schone kunsten</gloss>
</sense>
<sense>
<gloss xml:lang="fre">art</gloss>
<gloss xml:lang="fre">beaux-arts</gloss>
</sense>
<sense>
<gloss xml:lang="ger">Kunst</gloss>
<gloss xml:lang="ger">die schönen Künste</gloss>
<gloss xml:lang="ger">bildende Kunst</gloss>
</sense>
<sense>
<gloss xml:lang="ger">Produktionsdesign</gloss>
<gloss xml:lang="ger">Szenographie</gloss>
</sense>
<sense>
<gloss xml:lang="hun">művészet</gloss>
<gloss xml:lang="hun">művészeti</gloss>
<gloss xml:lang="hun">művészi</gloss>
<gloss xml:lang="hun">rajzóra</gloss>
<gloss xml:lang="hun">szépművészet</gloss>
</sense>
<sense>
<gloss xml:lang="rus">изящные искусства; искусство</gloss>
<gloss xml:lang="rus">{~{的}} художественный, артистический</gloss>
</sense>
<sense>
<gloss xml:lang="slv">umetnost</gloss>
<gloss xml:lang="slv">likovna umetnost</gloss>
</sense>
<sense>
<gloss xml:lang="spa">bellas artes</gloss>
</sense>
</entry>

I have a class object, Entry, which is used to store all of the information contained in an entry like the one above. I know what all the tags mean, I don't have an issue with interpreting the data semantically, I'm just not sure what tools I need to actually parse all of this information.

For example, how should I extract the contents of the <ent_seq> tag at the beginning? And is the method used to extract information from an XML tag the same even it's contained within a parent tag, as in the <keb> and <ke_pri> tags which are contained within the <k_ele> tags? Or should I use a different method?

I know this reads like homework help - I'm not asking for someone to provide the complete solution and build the parser. I just don't know where to start and what tools to use. I'd really appreciate some guidance on what methods I need to start parsing the XML file, and then I'll work on building the solution myself once I know what I'm doing.

-

Edit

So I've come across this code from this website which uses XMLReader to go through one node at a time:

Dim readXML As XmlReader = XmlReader.Create(New StringReader(xmlNode))
While readXML.Read()
    Select Case readXML.NodeType
        Case XmlNodeType.Element
            ListBox1.Items.Add("<" + readXML.Name & ">")
            Exit Select
        Case XmlNodeType.Text
            ListBox1.Items.Add(readXML.Value)
            Exit Select
        Case XmlNodeType.EndElement
            ListBox1.Items.Add("")
            Exit Select
    End Select
End While

But I get the error on the first line

'XmlNode' is a class type and cannot be used as an expression

I'm not exactly sure what to do about this error - any ideas?


回答1:


You can use these classes to deserialize your xml quickly

Imports System.IO
Imports System.Xml.Serialization
<XmlRoot>
Public Class jmdict
    <XmlElement("entry")>
    Public Property entries As List(Of entry)
End Class
Public Class entry
    Public Property ent_seq As Integer
    Public Property k_ele As k_ele
    Public Property r_ele As r_ele
    <XmlElement("sense")>
    Public Property senses As List(Of sense)
End Class
Public Class sense
    <XmlElement("pos")>
    Public Property posses As List(Of String)
    <XmlElement("gloss")>
    Public Property glosses As List(Of gloss)
End Class
Public Class k_ele
    Public Property keb As String
    <XmlElement("ke_pri")>
    Public Property ke_pris As List(Of String)
End Class
Public Class r_ele
    Public Property reb As String
    <XmlElement("re_pri")>
    Public Property re_pris As List(Of String)
End Class
Public Class gloss
    <XmlAttribute("xml:lang")>
    Public Property lang As String
    <XmlText>
    Public Property Text As String
    Public Overrides Function ToString() As String
        Return Text
    End Function
End Class

The code to deserialize is

Dim serializer As New XmlSerializer(GetType(jmdict))
Dim d As jmdict
Using sr As New StreamReader("filename.xml")
    d = CType(serializer.Deserialize(sr), jmdict)
End Using

Now you can iterate over each entry, and the entries' senses, and the senses' glosses

For Each e In d.entries
    Console.WriteLine($"seq: {e.ent_seq}")
    For Each s In e.senses
        For Each g In s.glosses
            Console.WriteLine($"Text: {g.Text}, Lang: {g.lang}")
        Next
    Next
Next

The reasons your code takes so long are

  1. You are parsing xml as string
  2. You are inserting lines into a ListBox as you parse them

What do you want to put in the ListBox? If you have deserialized as I show, you can databind a specific list from the data, or a queried result of multiple lists.



来源:https://stackoverflow.com/questions/60205990/is-there-a-best-practice-for-parsing-all-information-contained-within-one-parent

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!