RegEx for extracting HTML Image properties

不想你离开。 提交于 2019-12-02 08:56:41

As we all know, there are lots of malformed HTML out there, so the pattern has to cover those possibilities.

It won't. Use a HTML parser if you have to parse "evil" (from an unknown source) HTML.

If performance is not a big concern I'd go with an html parser (like BeautifulSoup in python) if you are doing this server-side or jquery or just plain javascript if you are doing it client-side. Granted it is overkill but it is a lot quicker, less likely to have bugs (since they've thought of the corner cases), and it will handle the potential malformedness.

Your best bet is to use something like HTML Agility Pack instead of using regex. It's designed to handle a lot of cases and can save you more than a few headaches due to hammering out edge cases

If you want all attribute values, might I suggest using the DOM? Something like element.attributes will work well.

If you insist on a regex //\b\w+="[^"]+"// should get everything.

ProfK

Before comitting yourself to regex, see what it can do: RegEx match open tags except XHTML self-contained tags

/<img(\s+([a-z]{3,})=(["']([^"']*)["']|[\S]))+\s*/?>/i

A match_all on this, will return (format depends on your library, but key indexes are):

0 -> image tag
1 -> attribute
2 -> attribute name
3 -> attribute value (with enclosing quotes if exists)
4 -> attribute value (without enclosing quotes if it has them, otherwise empty, use 3)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!