Parsing non-standard XML (CDATA tag)

前端 未结 2 861
无人共我
无人共我 2020-12-19 10:20

When I want to parsing XML document in Python using BeautifulSoup library, I faced some problems. The XML document that I want to parse:




        
相关标签:
2条回答
  • 2020-12-19 11:08

    You don't need BeautifulStoneSoup or lxml. Python's included batteries do the job just fine, and there doesn't seem to be anything non-compliant about your XML.

    >>> content='''\
    ... <item>
    ... <title><![CDATA[Title Sample]]></title>
    ... <link /><![CDATA[http://banhada.kr/?cateCode=09&viewCode=S0941580]]>
    ... <time_start>2011-10-10 09:00:00</time_start>
    ... <time_end>2011-10-17 09:00:00</time_end>
    ... <price_original>35000</price_original>
    ... <price_now>20000</price_now>
    ... </item>'''
    >>> import xml.etree.cElementTree as et
    >>> foo = et.XML(content)
    >>> for e in foo:
    ...     print e.tag, e.text, repr(e.tail)
    ...
    title Title Sample '\n'
    link None 'http://banhada.kr/?cateCode=09&viewCode=S0941580\n'
    time_start 2011-10-10 09:00:00 '\n'
    time_end 2011-10-17 09:00:00 '\n'
    price_original 35000 '\n'
    price_now 20000 '\n'
    >>>
    
    0 讨论(0)
  • 2020-12-19 11:18

    You could use BeautifulSoup to parse XML:

    import bs4 as bs
    content='''\
    <item>
    <title><![CDATA[Title Sample]]></title>
    <link /><![CDATA[http://banhada.kr/?cateCode=09&viewCode=S0941580]]>
    <time_start>2011-10-10 09:00:00</time_start>
    <time_end>2011-10-17 09:00:00</time_end>
    <price_original>35000</price_original>
    <price_now>20000</price_now>
    </item>'''    
    
    soup = bs.BeautifulSoup(content, 'xml')
    
    title = soup.title
    print(title.string)
    # Title Sample
    
    link = soup.link.nextSibling
    print(link)
    # http://banhada.kr/?cateCode=09&viewCode=S0941580
    

    Under the hood, BeautifulSoup uses lxml for parsing XML. Although it's not needed here, you might want to use lxml directly, since it gives you more succinct ways to navigate through XML using XPath:

    import lxml.etree as ET
    
    content='''\
    <item>
    <title><![CDATA[Title Sample]]></title>
    <link /><![CDATA[http://banhada.kr/?cateCode=09&viewCode=S0941580]]>
    <time_start>2011-10-10 09:00:00</time_start>
    <time_end>2011-10-17 09:00:00</time_end>
    <price_original>35000</price_original>
    <price_now>20000</price_now>
    </item>'''    
    
    doc = ET.fromstring(content)
    
    title = doc.find('title')
    print(title.text)
    # Title Sample
    
    link = doc.find('link')
    print(link.tail)
    # http://banhada.kr/?cateCode=09&viewCode=S0941580
    
    0 讨论(0)
提交回复
热议问题