REGEX for only data and end tag

不问归期 提交于 2019-12-13 04:56:29

问题


I am looking for REGEX which will give me data along with the end tag

e.g.

input:
-----------------
<p>ABC<p>
-----------------
Output would be
-----------------
ABC<p>

-----------------

it will only remove the first para

para tag,Not for the second para

tag and all text in between would be same.

I want to mention here that i am looking for

<p>ABC<p> 

not for

<p>ABC</p>

Its for specific text file having irregular

tags

Example:

i have big xhtml file like...

<p>scet</p>
<p>sunny </p>
<p>             <!--this tag is to be removed -->
<p>              <!--this tag is to be removed -->
<p>mark</p>
<p>Thomas </p>

its a complete XHTML file.having body head etc tags Only problem here is extra tags i am expecting output like this

<p>scet</p>
<p>sunny </p>

<p>mark</p>
<p>Thomas </p>

回答1:


One possibility: Use an xhtml parser that fixes malformed xhtml. One such library is libxml2. Then use the library to locate and remove empty p tags.




回答2:


Inspired by this excellent post:

(?# line 01) <(?<open>.+?)> 
(?# line 02) (?> 
(?# line 03) <(?<open>.+?)> (?<DEPTH>) 
(?# line 04) | 
(?# line 05) </\k<open>> (?<-DEPTH>) 
(?# line 06) | 
(?# line 07) .? 
(?# line 08) )* 
(?# line 09) (?(DEPTH)(?!)) 
(?# line 10) </\k<open>> 

This will extract only the correctly matching tags, but not the self-closed ones; it will also do a basic nesting check, but not much else:
input:

<p>scet</p>
<p>sunny </p>
incorrect
<p>
<p>
<pre>mark</pre>
<p>Thomas </s>
<a>asd</a>
<p/>
<p><a>this should match</a></p>
<p><a>should not match</p></a>

output:

<p>scet</p>
<p>sunny </p>
<a>asd</a>
<p><a>this should match</a></p>

Each line of output is one match. However, tags containing attributes will of course not be included. A regular expression that would handle more cases correctly would be truly horrifying to look at, even with the nice formatting showcased in the blog I linked to :)

In these cases (especially since I gather you need valid XHTML output) I would always recommend running the input through a specialized parser, preferably one which outputs the parsing errors nicely, and handling those errors, instead of hacking regular expressions. Don't know any good (X)HTML parsers though, didn't need to do something like that in a very long time.




回答3:


This will work, take html document in string xhtml

 public static class XHTMLCleanerUpperThingy
    {
        private const string p = "<p>";
        private const string closingp = "</p>";

    public static string CleanUpXHTML(string xhtml)
    {
        StringBuilder builder = new StringBuilder(xhtml);
        for (int idx = 0; idx < xhtml.Length; idx++)
        {
            int current;
            if ((current = xhtml.IndexOf(p, idx)) != -1)
            {
                int idxofnext = xhtml.IndexOf(p, current + p.Length);
                int idxofclose = xhtml.IndexOf(closingp, current);

                // if there is a next <p> tag
                if (idxofnext > 0)
                {
                    // if the next closing tag is farther than the next <p> tag
                    if (idxofnext < idxofclose)
                    {
                        for (int j = 0; j < p.Length; j++)
                        {
                            builder[current + j] = ' ';
                        }
                    }
                }
                // if there is not a final closing tag
                else if (idxofclose < 0)
                {
                    for (int j = 0; j < p.Length; j++)
                    {
                        builder[current + j] = ' ';
                    }
                }
            }
        }

        return builder.ToString();
    }
}


来源:https://stackoverflow.com/questions/3666901/regex-for-only-data-and-end-tag

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!