regex to match html tags with specific attributes

安稳与你 提交于 2019-12-20 17:34:13

问题


I am trying to match all HTML tags that do not have the attribute "term" or "range"

here is sample HTML format

<span class="inline prewrap strong">DATE:</span>    12/01/10
<span class="inline prewrap strong">MR:</span>  1234567
<span class="inline prewrap strong">DOB:</span> 12/01/65
<span class="inline prewrap strong">HISTORY OF PRESENT ILLNESS:</span>  Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum

<span class="inline prewrap strong">MEDICATIONS:</span>  <span term="Advil" range="true">Advil </span>and Ibuprofen.

My regex is: <(.*?)((?!\bterm\b).)>

Unfortunately this matches all the tags...It would be nice if the inner text wouldn't be matched as i need to filter out all the tags except the ones with that specific attribute.


回答1:


If regex is your thing for this, this works for me. (Note - filterring out comments, doctype and other entities is not included.
Other warnings; tags could be embeded in script, comments and other things.)

span tag (w/ attr) no term|range attrs

'<span
  (?=\s)
  (?! (?:[^>"\']|(?>".*?"|\'.*?\'))*? (?<=\s) (?:term|range) \s*= )
  \s+ (?:".*?"|\'.*?\'|[^>]*?)+ 
>'

any tag (w/ attr) no term|range attrs

'<[A-Za-z_:][\w:.-]*
  (?=\s)
  (?! (?:[^>"\']|(?>".*?"|\'.*?\'))*? (?<=\s) (?:term|range) \s*= )
  \s+ (?:".*?"|\'.*?\'|[^>]*?)+ 
>'

any tag (w/o attr) no term|range attrs

'<
  (?:
    [A-Za-z_:][\w:.-]*
    (?=\s)
    (?! (?:[^>"\']|(?>".*?"|\'.*?\'))*? (?<=\s) (?:term|range) \s*= )
    \s+ (?:".*?"|\'.*?\'|[^>]*?)+ 
  |
    /?[A-Za-z_:][\w:.-]*\s*/?
  )
>'

Update

Alternative to using (?>) construct
Below regex's are for no-'term|range'-attributes
Flags = (g)global and (s)dotall

span tag w/attr
link: http://regexr.com?2vrjr
regex: <span(?=\s)(?!(?:[^>"\']|"[^"]*"|\'[^\']*\')*?(?<=\s)(?:term|range)\s*=)(?!\s*/?>)\s+(?:".*?"|\'.*?\'|[^>]*?)+>

any tag w/attr
link: http://regexr.com?2vrju
regex: <[A-Za-z_:][\w:.-]*(?=\s)(?!(?:[^>"\']|"[^"]*"|\'[^\']*\')*?(?<=\s)(?:term|range)\s*=)(?!\s*/?>)\s+(?:".*?"|\'.*?\'|[^>]*?)+>

any tag w/attr or wo/attr
link: http://regexr.com?2vrk1
regex: <(?:[A-Za-z_:][\w:.-]*(?=\s)(?!(?:[^>"\']|"[^"]*"|\'[^\']*\')*?(?<=\s)(?:term|range)\s*=)(?!\s*/?>)\s+(?:".*?"|\'.*?\'|[^>]*?)+|/?[A-Za-z_:][\w:.-]*\s*/?)>

'to match every tag except the ones that have term="occasionally"'

link: http://regexr.com?2vrka
<(?:[A-Za-z_:][\w:.-]*(?=\s)(?!(?:[^>"\']|"[^"]*"|\'[^\']*\')*?(?<=\s)term\s*=\s*(["'])\s*occasionally\s*\1)(?!\s*/?>)\s+(?:".*?"|\'.*?\'|[^>]*?)+|/?[A-Za-z_:][\w:.-]*\s*/?)>




回答2:


I think you should use an HTML parser to solve this problem. Creating own regular expression is possible but erroneous for sure. Imagine that your code contains such expression

< span      class = "a"              >b< / span         >

It is also valid, but to consider all possible spaces and TAB characters in your regular expression would be not easy and would require testing before you can be sure that it works as it is expected.




回答3:


This will do what you want. It is written for a Perl program, and the format may differ depending on what language you are using

/(?! [^>]+ \b(?:item|range)= ) (<[a-z]+.*?>) /igx

The code below demonstrates this pattern in a Perl program

use strict;
use warnings;

my $pattern = qr/ (?! [^>]+ \b(?:item|range)= ) (<[a-z]+.*?>) /ix;

my $str = <<'END';

<span class="inline prewrap strong">DATE:</span>    12/01/10
<span class="inline prewrap strong">MR:</span>  1234567
<span class="inline prewrap strong">DOB:</span> 12/01/65
<span class="inline prewrap strong">HISTORY OF PRESENT ILLNESS:</span>  Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum

<span class="inline prewrap strong">MEDICATIONS:</span>  <span term="Advil" range="true">Advil </span>and Ibuprofen.

END

print "$_\n" foreach $str =~ /$pattern/g;

OUTPUT

<span class="inline prewrap strong">
<span class="inline prewrap strong">
<span class="inline prewrap strong">
<span class="inline prewrap strong">
<span class="inline prewrap strong">



回答4:


<\w+\s+(?!term).*?>(.*?)</.*?>



回答5:


I think this regex will work properly.

This regex will select style attribute of any HTML tag.

<\s*\w*\s*style.*?>

You can check this on https://regex101.com



来源:https://stackoverflow.com/questions/9008430/regex-to-match-html-tags-with-specific-attributes

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!