Python Regex doesn't work as expected

自闭症网瘾萝莉.ら 提交于 2019-12-31 04:55:06

问题


i've crafted this regular expression:

<entry>\\n<(\w+)>(.+?)</\w+>\\n</entry>

to parse the following RSS Feed:

<?xml version="1.0" encoding="UTF-8"?>\n<feed version="0.3" xmlns="http://purl.org/atom/ns#">\n<title>Gmail - Inbox for g.bargelli@gmail.com</title>\n<tagline>New messages in your Gmail Inbox</tagline>\n<fullcount>2</fullcount>\n<link rel="alternate" href="http://mail.google.com/mail" type="text/html" />\n<modified>2011-03-15T11:07:48Z</modified>\n<entry>\n<title>con due mail...</title>\n<summary>Gianluca Bargelli http://about.me/proudlygeek/bio</summary>\n<link rel="alternate" href="http://mail.google.com/mail?account_id=g.bargelli@gmail.com&amp;message_id=12eb9332c2c1fa27&amp;view=conv&amp;extsrc=atom" type="text/html" />\n<modified>2011-03-15T11:07:42Z</modified>\n<issued>2011-03-15T11:07:42Z</issued>\n<id>tag:gmail.google.com,2004:1363345158434847271</id>\n<author>\n<name>me</name>\n<email>g.bargelli@gmail.com</email>\n</author>\n</entry>\n<entry>\n<title>test nuova mail</title>\n<summary>Gianluca Bargelli sono tornato!?& http://about.me/proudlygeek/bio</summary>\n<link rel="alternate" href="http://mail.google.com/mail?account_id=g.bargelli@gmail.com&amp;message_id=12eb93140d9f7627&amp;view=conv&amp;extsrc=atom" type="text/html" />\n<modified>2011-03-15T11:05:36Z</modified>\n<issued>2011-03-15T11:05:36Z</issued>\n<id>tag:gmail.google.com,2004:1363345026546890279</id>\n<author>\n<name>me</name>\n<email>g.bargelli@gmail.com</email>\n</author>\n</entry>\n</feed>\n'skinner.com/products/spl].

The problem is that i am not getting any matches by using Python's re module:

import re

regex = re.compile("""<entry>\\n<(\w+)>(.+?)</\w+>\\n</entry>""")
regex.findall(rss_string) # Returns an empty list

Using an online regex tester (such as this) works as expected, so i don't think is a regex problem.

Edit

I am well aware that using regular expressions to parse a Context-Free Grammar is BAD, but in my case the regular expression is likely to work only for that RSS feed (it is a Gmail inbox feed, by the way) and i know i can use an external library/xml parser for this task: it is only an exercise, not an habit.

The question should be Why the following regular expression don't work as expected in Python?


回答1:


Before the regex compiler sees a string, Python has already processed the slash-escapes, therefore you'd have to escape it twice (e.g. \\\\n for \\n). However, Python has a handy notation for exactly this sort of thing, just stick an r before the string:

regex = re.compile(r"""<entry>\\n<(\w+)>(.+?)</\w+>\\n</entry>""")

By the way, I agree with the others here, do not use regexes to parse XML. However, hopefully you will find this string notation helpful in future regular expressions.




回答2:


You shouldn't parse XML with regex, instead you should use the Universal Feed Parser for Python. Using this library over regex will make your life easier and has been battle tested to be correct.

I personally have used this library many times, it works like a charm.




回答3:


DON'T PARSE XML/HTML WITH REGEX!

Use one of the following:

  • BeautifulSoup
  • lxml
  • pyquery

Enjoy!

EDIT: Oh yeah it's RSS. What the other people said... I'll be here all week.




回答4:


Do not try to reinvent wheels or playing the smart RSS parser guy. Reuse existing modules: http://www.feedparser.org/



来源:https://stackoverflow.com/questions/5319571/python-regex-doesnt-work-as-expected

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!