问题
What is the best way to select all the text between 2 tags - ex: the text between all the 'pre' tags on the page.
回答1:
You can use "<pre>(.*?)</pre>"
, (replacing pre with whatever text you want) and extract the first group (for more specific instructions specify a language) but this assumes the simplistic notion that you have very simple and valid HTML.
As other commenters have suggested, if you're doing something complex, use a HTML parser.
回答2:
Tag can be completed in another line. This is why \n
needs to be added.
<PRE>(.|\n)*?<\/PRE>
回答3:
This is what I would use.
(?<=(<pre>))(\w|\d|\n|[().,\-:;@#$%^&*\[\]"'+–/\/®°⁰!?{}|`~]| )+?(?=(</pre>))
Basically what it does is:
(?<=(<pre>))
Selection have to be prepend with <pre>
tag
(\w|\d|\n|[().,\-:;@#$%^&*\[\]"'+–/\/®°⁰!?{}|~]| )
This is just a regular expression I want to apply. In this case, it selects letter or digit or newline character or some special characters listed in the example in the square brackets. The pipe character |
simply means "OR".
+?
Plus character states to select one or more of the above - order does not matter. Question mark changes the default behavior from 'greedy' to 'ungreedy'.
(?=(</pre>))
Selection have to be appended by the </pre>
tag
Depending on your use case you might need to add some modifiers like (i or m)
- i - case-insensitive
- m - multi-line search
Here I performed this search in Sublime Text so I did not have to use modifiers in my regex.
Javascript does not support lookbehind
The above example should work fine with languages such as PHP, Perl, Java ...
Javascript, however, does not support lookbehind so we have to forget about using (?<=(<pre>))
and look for some kind of workaround. Perhaps simple strip the first four chars from our result for each selection like in here
Regex match text between tags
Also look at the JAVASCRIPT REGEX DOCUMENTATION for non-capturing parentheses
回答4:
use the below pattern to get content between element. Replace [tag]
with the actual element you wish to extract the content from.
<[tag]>(.+?)</[tag]>
Sometime tags will have attributes, like anchor
tag having href
, then use the below pattern.
<[tag][^>]*>(.+?)</[tag]>
回答5:
You shouldn't be trying to parse html with regexes see this question and how it turned out.
In the simplest terms, html is not a regular language so you can't fully parse is with regular expressions.
Having said that you can parse subsets of html when there are no similar tags nested. So as long as anything between and is not that tag itself, this will work:
preg_match("/<([\w]+)[^>]*>(.*?)<\/\1>/", $subject, $matches);
$matches = array ( [0] => full matched string [1] => tag name [2] => tag content )
A better idea is to use a parser, like the native DOMDocument, to load your html, then select your tag and get the inner html which might look something like this:
$obj = new DOMDocument();
$obj -> load($html);
$obj -> getElementByTagName('el');
$value = $obj -> nodeValue();
And since this is a proper parser it will be able to handle nesting tags etc.
回答6:
To exclude the delimiting tags:
"(?<=<pre>)(.*?)(?=</pre>)"
回答7:
Try this....
(?<=\<any_tag\>)(\s*.*\s*)(?=\<\/any_tag\>)
回答8:
This seems to be the simplest regular expression of all that I found
(?:<TAG>)([\s\S]*)(?:<\/TAG>)
- Exclude opening tag
(?:<TAG>)
from the matches - Include any whitespace or non-whitespace characters
([\s\S]*)
in the matches - Exclude closing tag
(?:<\/TAG>)
from the matches
回答9:
var str = "Lorem ipsum <pre>text 1</pre> Lorem ipsum <pre>text 2</pre>";
str.replace(/<pre>(.*?)<\/pre>/g, function(match, g1) { console.log(g1); });
Since accepted answer is without javascript code, so adding that:
回答10:
preg_match_all(/<pre>([^>]*?)<\/pre>/,$content,$matches)
this regex will select everyting between tag. no matter is it in new line(work with multiline.
回答11:
For multiple lines:
<htmltag>(.+)((\s)+(.+))+</htmltag>
回答12:
You can use Pattern pattern = Pattern.compile( "[^<'tagname'/>]" );
回答13:
I use this solution:
preg_match_all( '/<((?!<)(.|\n))*?\>/si', $content, $new);
var_dump($new);
回答14:
In Python, setting the DOTALL
flag will capture everything, including newlines.
If the DOTALL flag has been specified, this matches any character including a newline. docs.python.org
#example.py using Python 3.7.4
import re
str="""Everything is awesome! <pre>Hello,
World!
</pre>
"""
# Normally (.*) will not capture newlines, but here re.DOTATLL is set
pattern = re.compile(r"<pre>(.*)</pre>",re.DOTALL)
matches = pattern.search(str)
print(matches.group(1))
python example.py
Hello,
World!
Capturing text between all opening and closing tags in a document
To capture text between all opening and closing tags in a document, finditer
is useful. In the example below, three opening and closing <pre>
tags are present in the string.
#example2.py using Python 3.7.4
import re
# str contains three <pre>...</pre> tags
str = """In two different ex-
periments, the authors had subjects chat and solve the <pre>Desert Survival Problem</pre> with a
humorous or non-humorous computer. In both experiments the computer made pre-
programmed comments, but in study 1 subjects were led to believe they were interact-
ing with another person. In the <pre>humor conditions</pre> subjects received a number of funny
comments, for instance: “The mirror is probably too small to be used as a signaling
device to alert rescue teams to your location. Rank it lower. (On the other hand, it
offers <pre>endless opportunity for self-reflection</pre>)”."""
# Normally (.*) will not capture newlines, but here re.DOTATLL is set
# The question mark in (.*?) indicates non greedy matching.
pattern = re.compile(r"<pre>(.*?)</pre>",re.DOTALL)
matches = pattern.finditer(str)
for i,match in enumerate(matches):
print(f"tag {i}: ",match.group(1))
python example2.py
tag 0: Desert Survival Problem
tag 1: humor conditions
tag 2: endless opportunity for self-reflection
回答15:
<pre>([\r\n\s]*(?!<\w+.*[\/]*>).*[\r\n\s]*|\s*[\r\n\s]*)<code\s+(?:class="(\w+|\w+\s*.+)")>(((?!<\/code>)[\s\S])*)<\/code>[\r\n\s]*((?!<\w+.*[\/]*>).*|\s*)[\r\n\s]*<\/pre>
来源:https://stackoverflow.com/questions/7167279/regex-select-all-text-between-tags