How to get value of specified tag attribute from XML using regexp + Python?

六月ゝ 毕业季﹏ 提交于 2020-06-09 07:13:09

问题


I have a script that parses some xml. XML contains:

<SD TITLE="A" FLAGS="" HOST="9511.com">
<TITLE TEXT="9511 domain"/>
<ADDR STREET="Pmb#400, San Pablo Ave" CITY="Berkeley" STATE="CA" COUNTRY="US"/>
<CREATED DATE="13-Oct-1990" DAY="13" MONTH="10" YEAR="1990"/>
<OWNER NAME="9511.Org Domain Name Proxy Agents"/>
<EMAIL ADDR="proxy@9511.org"/><LANG LEX="en" CODE="us-ascii"/>
<LINKSIN NUM="75"/><SPEED TEXT="3158" PCT="17"/>
<CHILD SRATING="0"/>
</SD>
<SD>
<POPULARITY URL="9511.com/" TEXT="1417678" SOURCE="panel"/>
</SD>

How to get the 'TEXT' attribute value of tag(in my case 1417678)? I'm using regexp+Python. Regexp string:

my_value = re.findall("POPULARITY[^\d]*(\d+)", xml)

It gets to me '9511' but i need '1417678'.


回答1:


You are just matching the first sequence of decimal digits that occurs after the element's name. The first sequence of digits '(\d+)' after a arbitrary number of non-digits '[^\d]*' is 9511.

In order to findall values of @TEXT attributes, something like this would work:

my_values = re.findall("<POPULARITY(?:\D+=\"\S*\")*\s+TEXT=\"(\d*)\"", xml) # returning a list btw

Or, if no other attributes will have digit-only values except @TEXT:

 re.findall("<POPULARITY\s+(?:\S+\s+)*\w+=\"(\d+)\"", xml)

Where (?:...) matches the embraced expression, but doesn't act as an addressable group, like (...). The special sequences \S and \D are the invertions of their lowercase counterparts, expanding to (anything but) whitespace and digits, respectively.

However, like already mentioned, regex are not meant to be used on XML, because XML is not a regular language.




回答2:


You can use BeautifulSoup

import BeautifulSoup

xml = '''<SD TITLE="A" FLAGS="" HOST="9511.com">
<TITLE TEXT="9511 domain"/>
<ADDR STREET="Pmb#400, San Pablo Ave" CITY="Berkeley" STATE="CA" COUNTRY="US"/>
<CREATED DATE="13-Oct-1990" DAY="13" MONTH="10" YEAR="1990"/>
<OWNER NAME="9511.Org Domain Name Proxy Agents"/>
<EMAIL ADDR="proxy@9511.org"/><LANG LEX="en" CODE="us-ascii"/>
<LINKSIN NUM="75"/><SPEED TEXT="3158" PCT="17"/>
<CHILD SRATING="0"/>
</SD>
<SD>
<POPULARITY URL="9511.com/" TEXT="1417678" SOURCE="panel"/>
</SD>'''

soup = BeautifulSoup.BeautifulSoup(xml)

print(soup.find('popularity')['text'])

Output

u'1417678'


来源:https://stackoverflow.com/questions/15129986/how-to-get-value-of-specified-tag-attribute-from-xml-using-regexp-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!