Extract data from XML file if arguments are of certain values

问题

I want to loop through a Wikipedia dump in XML format and for each revision I want to save the Timestamp and the Comment if the revision is made by a certain username. Is this possible? I'm trying to get familiar with lxml.

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">
    <siteinfo>
        <sitename>Wikipedia</sitename>
        <dbname>enwiki</dbname>
        <base>https://en.wikipedia.org/wiki/Main_Page</base>
        <generator>MediaWiki 1.27.0-wmf.18</generator>
        <case>first-letter</case>
        <namespaces>...</namespaces>
    </siteinfo>
    <page>
        <title>Zhuangzi</title>
        <ns>0</ns>
        <id>42870472</id>
        <revision>
            <id>610251969</id>
            <timestamp>2014-05-26T20:08:14Z</timestamp>
            <contributor>
                <username>White whirlwind</username>
                <id>8761551</id>
            </contributor>
            <comment>...</comment>
            <model>wikitext</model>
            <format>text/x-wiki</format>
            <text xml:space="preserve" bytes="41">#REDIRECT [[Zhuang Zhou]] {{R from move}}</text>
            <sha1>9l31fcd4fp0cfxgearifr7jrs3240xl</sha1>
        </revision>
        <revision>...</revision>
        <revision>...</revision>
        <revision>...</revision>
        <revision>...</revision>
        <revision>...</revision>

    </page>
    <page>...</page>
</mediawiki>

回答1:

import xmltodict 


xml_input = """
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">
<siteinfo>
    <sitename>Wikipedia</sitename>
    <dbname>enwiki</dbname>
    <base>https://en.wikipedia.org/wiki/Main_Page</base>
    <generator>MediaWiki 1.27.0-wmf.18</generator>
    <case>first-letter</case>
    <namespaces>...</namespaces>
</siteinfo>
<page>
    <title>Zhuangzi</title>
    <ns>0</ns>
    <id>42870472</id>
    <revision>
        <id>610251969</id>
        <timestamp>2014-05-25T20:08:14Z</timestamp>
        <contributor>
            <username>Patric</username>
            <id>8761551</id>
        </contributor>
    </revision>
    <revision>
        <id>610251969</id>
        <timestamp>2014-05-26T20:08:14Z</timestamp>
        <contributor>
            <username>Don</username>
            <id>8761551</id>
        </contributor>
    </revision>
    <revision>
        <id>610251969</id>
        <timestamp>2014-05-27T20:08:14Z</timestamp>
        <contributor>
            <username>Patric</username>
            <id>8761551</id>
        </contributor>
    </revision>                
</page>
</mediawiki>
"""


dic_xml = xmltodict.parse(xml_input)

for rev in dic_xml['mediawiki']['page']['revision']:
    if rev['contributor']['username'] == 'Patric':
        print rev['id']
        print rev['timestamp']

with your file:

import xmltodict
with open('/home/jurkij/Downloads/testarticles.xml') as xml_file:
    dic_xml = xmltodict.parse(xml_file.read())
    for page in dic_xml['mediawiki']['page']:
        for rev in  page['revision']:
            if 'username' in rev['contributor'] and rev['contributor']['username'] == 'Aristophanes68':
                print rev['timestamp']
                print rev['id']

回答2:

Yes, this is possible using lxml.

You know what nodes you are looking for (start with the reivision's username), so write code to select that node and compare the value against the known name you are looking for.

Once you have done that part, saving the timestamp and comment should be simple.

You will find what you need in the lxml documentation (http://lxml.de/); look into the sections on "XPath" to figure out how to select the nodes you want (this will include snippets that load the XML into your script)

You may also wish to consult the ElementTree tutorial that lxml links (http://effbot.org/zone/element.htm) to get an understanding of how you can use the XML elements you'll find using the XPath or other methods. This will be useful for getting the values from the elements.

回答3:

Continuing on from your last question, you can easily do it with lxml and an xpath expression:

from lxml.etree import parse

tree = parse("test.xml")

ns = {"wiki": "http://www.mediawiki.org/xml/export-0.10/"}
revs = tree.xpath("//wiki:revision[.//wiki:username='White whirlwind']",namespaces=ns)

print([(rev.xpath(".//wiki:timestamp//text()", namespaces=ns)[0],rev.xpath(".//wiki:username//text()", namespaces=ns)[0]) for rev in revs])

For the following xml:

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">
    <siteinfo>
        <sitename>Wikipedia</sitename>
        <dbname>enwiki</dbname>
        <base>https://en.wikipedia.org/wiki/Main_Page</base>
        <generator>MediaWiki 1.27.0-wmf.18</generator>
        <case>first-letter</case>
        <namespaces>...</namespaces>
    </siteinfo>
    <page>
        <title>Zhuangzi</title>
        <ns>0</ns>
        <id>42870472</id>
        <revision>
            <id>610251969</id>
            <timestamp>2014-05-26T20:08:14Z</timestamp>
            <contributor>
                <username>White whirlwind</username>
                <id>8761551</id>
            </contributor>
            <comment>...</comment>
            <model>wikitext</model>
            <format>text/x-wiki</format>
            <text xml:space="preserve" bytes="41">#REDIRECT [[Zhuang Zhou]] {{R from move}}</text>
            <sha1>9l31fcd4fp0cfxgearifr7jrs3240xl</sha1>
        </revision>
        <revision>
                 <id>610251969</id>
            <timestamp>2014-06-26T20:08:14Z</timestamp>
            <contributor>
                <username>White whirlwind</username>
                <id>8761551</id>
            </contributor>
            <comment>...</comment>
            <model>wikitext</model>
            <format>text/x-wiki</format>
            <text xml:space="preserve" bytes="41">#REDIRECT [[Zhuang Zhou]] {{R from move}}</text>
            <sha1>9l31fcd4fp0cfxgearifr7jrs3240xl</sha1>
        </revision>
        <revision>     <id>610251969</id>
            <timestamp>2014-07-26T20:08:14Z</timestamp>
            <contributor>
                <username>foobar</username>
                <id>8761551</id>
            </contributor>
            <comment>...</comment>
            <model>wikitext</model>
            <format>text/x-wiki</format>
            <text xml:space="preserve" bytes="41">#REDIRECT [[Zhuang Zhou]] {{R from move}}</text>
            <sha1>9l31fcd4fp0cfxgearifr7jrs3240xl</sha1></revision>
        <revision>...</revision>
        <revision>...</revision>
        <revision>...</revision>

        </page>

Outputs:

 [[('2014-05-26T20:08:14Z', 'White whirlwind'), ('2014-06-26T20:08:14Z', 'White whirlwind')]

//wiki:revision[.//wiki:username='White whirlwind'] finds all the revision tags that contain a username and that username value is White whirlwind, you will see it returns 2 as foo does not match, you just need to extract the timestamp and username values from each of the filtered revisions in revs.

For your file in google drive it returns:

[('2014-05-26T20:08:14Z', 'White whirlwind'), 
('2014-05-26T20:12:49Z', 'White whirlwind'),
 ('2014-05-26T20:13:04Z', 'White whirlwind'),
('2014-05-31T21:14:15Z', 'White whirlwind'), 
('2015-10-11T19:24:46Z', 'White whirlwind'),
 ('2015-10-11T19:26:31Z', 'White whirlwind')]

Which if you check your file is correct.

来源：https://stackoverflow.com/questions/36333763/extract-data-from-xml-file-if-arguments-are-of-certain-values

标签

python

xml

lxml