问题

i've crafted this regular expression:

<entry>\\n<(\w+)>(.+?)</\w+>\\n</entry>

to parse the following RSS Feed:

<?xml version="1.0" encoding="UTF-8"?>\n<feed version="0.3" xmlns="http://purl.org/atom/ns#">\n<title>Gmail - Inbox for g.bargelli@gmail.com</title>\n<tagline>New messages in your Gmail Inbox</tagline>\n<fullcount>2</fullcount>\n<link rel="alternate" href="http://mail.google.com/mail" type="text/html" />\n<modified>2011-03-15T11:07:48Z</modified>\n<entry>\n<title>con due mail...</title>\n<summary>Gianluca Bargelli http://about.me/proudlygeek/bio</summary>\n<link rel="alternate" href="http://mail.google.com/mail?account_id=g.bargelli@gmail.com&amp;message_id=12eb9332c2c1fa27&amp;view=conv&amp;extsrc=atom" type="text/html" />\n<modified>2011-03-15T11:07:42Z</modified>\n<issued>2011-03-15T11:07:42Z</issued>\n<id>tag:gmail.google.com,2004:1363345158434847271</id>\n<author>\n<name>me</name>\n<email>g.bargelli@gmail.com</email>\n</author>\n</entry>\n<entry>\n<title>test nuova mail</title>\n<summary>Gianluca Bargelli sono tornato!?& http://about.me/proudlygeek/bio</summary>\n<link rel="alternate" href="http://mail.google.com/mail?account_id=g.bargelli@gmail.com&amp;message_id=12eb93140d9f7627&amp;view=conv&amp;extsrc=atom" type="text/html" />\n<modified>2011-03-15T11:05:36Z</modified>\n<issued>2011-03-15T11:05:36Z</issued>\n<id>tag:gmail.google.com,2004:1363345026546890279</id>\n<author>\n<name>me</name>\n<email>g.bargelli@gmail.com</email>\n</author>\n</entry>\n</feed>\n'skinner.com/products/spl].

The problem is that i am not getting any matches by using Python's re module:

import re

regex = re.compile("""<entry>\\n<(\w+)>(.+?)</\w+>\\n</entry>""")
regex.findall(rss_string) # Returns an empty list

Using an online regex tester (such as this) works as expected, so i don't think is a regex problem.

Edit

I am well aware that using regular expressions to parse a Context-Free Grammar is BAD, but in my case the regular expression is likely to work only for that RSS feed (it is a Gmail inbox feed, by the way) and i know i can use an external library/xml parser for this task: it is only an exercise, not an habit.

The question should be Why the following regular expression don't work as expected in Python?

回答1:

Before the regex compiler sees a string, Python has already processed the slash-escapes, therefore you'd have to escape it twice (e.g. \\\\n for \\n). However, Python has a handy notation for exactly this sort of thing, just stick an r before the string:

regex = re.compile(r"""<entry>\\n<(\w+)>(.+?)</\w+>\\n</entry>""")

By the way, I agree with the others here, do not use regexes to parse XML. However, hopefully you will find this string notation helpful in future regular expressions.

回答2:

You shouldn't parse XML with regex, instead you should use the Universal Feed Parser for Python. Using this library over regex will make your life easier and has been battle tested to be correct.

I personally have used this library many times, it works like a charm.

回答3:

DON'T PARSE XML/HTML WITH REGEX!

Use one of the following:

BeautifulSoup
lxml
pyquery

Enjoy!

EDIT: Oh yeah it's RSS. What the other people said... I'll be here all week.

回答4:

Do not try to reinvent wheels or playing the smart RSS parser guy. Reuse existing modules: http://www.feedparser.org/

来源：https://stackoverflow.com/questions/5319571/python-regex-doesnt-work-as-expected

标签

python

regex

rss

Python Regex doesn't work as expected

问题

Edit

回答1:

回答2:

回答3:

回答4: