Extracting text from XML using python

前端 未结 6 1900
野的像风
野的像风 2020-12-09 11:48

I have this example xml file


  Chapter 1
  Welcome to Chapter 1



        
相关标签:
6条回答
  • 2020-12-09 12:09

    Code :

    from xml.etree import cElementTree as ET
    
    tree = ET.parse("test.xml")
    root = tree.getroot()
    
    for page in root.findall('page'):
        print("Title: ", page.find('title').text)
        print("Content: ", page.find('content').text)
    

    Output:

    Title:  Chapter 1
    Content:  Welcome to Chapter 1
    Title:  Chapter 2
    Content:  Welcome to Chapter 2
    
    0 讨论(0)
  • 2020-12-09 12:15

    You can also try this code to extract texts:

    from bs4 import BeautifulSoup
    import csv
    
    data ="""<page>
      <title>Chapter 1</title>
      <content>Welcome to Chapter 1</content>
    </page>
    <page>
     <title>Chapter 2</title>
     <content>Welcome to Chapter 2</content>
    </page>"""
    
    soup = BeautifulSoup(data, "html.parser")
    
    ########### Title #############
    required0 = soup.find_all("title")
    title = []
    for i in required0:
        title.append(i.get_text())
    
    ########### Content #############
    required0 = soup.find_all("content")
    content = []
    for i in required0:
        content.append(i.get_text())
    
    doc1 = list(zip(title, content))
    for i in doc1:
        print(i)
    

    Output:

    ('Chapter 1', 'Welcome to Chapter 1')
    ('Chapter 2', 'Welcome to Chapter 2')
    
    0 讨论(0)
  • 2020-12-09 12:17

    Recommend you a simple library. Here's an example: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples

    from simplified_scrapy.simplified_doc import SimplifiedDoc
    html ='''
    <page>
      <title>Chapter 1</title>
      <content>Welcome to Chapter 1</content>
    </page>
    <page>
     <title>Chapter 2</title>
     <content>Welcome to Chapter 2</content>
    </page>'''
    doc = SimplifiedDoc(html)
    pages = doc.pages
    print ([(page.title.text,page.content.text) for page in pages])
    

    Result:

    [('Chapter 1', 'Welcome to Chapter 1'), ('Chapter 2', 'Welcome to Chapter 2')]
    
    0 讨论(0)
  • 2020-12-09 12:24

    There is already a built-in XML library, notably ElementTree. For example:

    >>> from xml.etree import cElementTree as ET
    >>> xmlstr = """
    ... <root>
    ... <page>
    ...   <title>Chapter 1</title>
    ...   <content>Welcome to Chapter 1</content>
    ... </page>
    ... <page>
    ...  <title>Chapter 2</title>
    ...  <content>Welcome to Chapter 2</content>
    ... </page>
    ... </root>
    ... """
    >>> root = ET.fromstring(xmlstr)
    >>> for page in list(root):
    ...     title = page.find('title').text
    ...     content = page.find('content').text
    ...     print('title: %s; content: %s' % (title, content))
    ...
    title: Chapter 1; content: Welcome to Chapter 1
    title: Chapter 2; content: Welcome to Chapter 2
    
    0 讨论(0)
  • 2020-12-09 12:31

    I personally prefer parsing using xml.dom.minidom like so:

    In [18]: import xml.dom.minidom
    
    In [19]: x = """\
    <root><page>
      <title>Chapter 1</title>
      <content>Welcome to Chapter 1</content>
    </page>
    <page>
     <title>Chapter 2</title>
     <content>Welcome to Chapter 2</content>
    </page></root>"""
    
    In [28]: doc = xml.dom.minidom.parseString(x)
    In [29]: doc.getElementsByTagName("page")
    Out[30]: [<DOM Element: page at 0x94d5acc>, <DOM Element: page at 0x94d5c8c>]
    
    In [32]: [p.firstChild.wholeText for p in doc.getElementsByTagName("title") if p.firstChild.nodeType == p.TEXT_NODE]
    Out[33]: [u'Chapter 1', u'Chapter 2']
    
    In [34]: [p.firstChild.wholeText for p in doc.getElementsByTagName("content") if p.firstChild.nodeType == p.TEXT_NODE]
    Out[35]: [u'Welcome to Chapter 1', u'Welcome to Chapter 2']
    
    In [36]: for node in doc.childNodes:
                 if node.hasChildNodes:
                     for cn in node.childNodes:
                         if cn.hasChildNodes:
                             for cn2 in cn.childNodes:
                                 if cn2.nodeType == cn2.TEXT_NODE:
                                     print cn2.wholeText
    Out[37]: Chapter 1
             Welcome to Chapter 1
             Chapter 2
             Welcome to Chapter 2
    
    0 讨论(0)
  • 2020-12-09 12:33

    For working (navigating, searching, and modifying) with XML or HTML data, I found Beautiful library very useful. For installation problem or detailed information, click on link.

    To find Attribute (tag) or multi-attribute values:

    from bs4 import BeautifulSoup
    data = """<?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">
    
    <pdf2xml producer="poppler" version="0.48.0">
    <page number="1" position="absolute" top="0" left="0" height="1188" width="918">
    <text top="246" left="135" width="178" height="16" font="1">PALS SOCIETY OF 
    CANADA</text>
    <text top="261" width="86" height="16" font="1">13479 77 AVE</text>
    </page>
    </pdf2xml>"""
    
    soup = BeautifulSoup(data, "lxml")
    page_tag = soup.find_all('page')
    details_tag = page_tag[0].find_all('text')
    details_tag_count = len(details_tag)
    for iter_text in range(details_tag_count):
        print("Text : ", details_tag[iter_text].text)
        print("Left tag : ", details_tag[iter_text].get("left"))
    

    Output:

    Text :  PALS SOCIETY OF CANADA
    Left tag :  135
    Text :  13479 77 AVE
    Left tag :  None
    
    0 讨论(0)
提交回复
热议问题