Regexp to remove all html tags except <br>

自闭症网瘾萝莉.ら 提交于 2021-02-05 06:12:12

问题


I'm trying to make a regexp in javascript to remove ALL the html tags from an input string, except <br>.

I use /(<([^>]+)>)/ig for the tags and have tried a few things like adding [^(br)] to it, but I'm just getting confused now.

Could anyone help? I'm sure it's going to be a speed contest between SO gurus, so if the answer explains the logic of the expression, I'll choose it over the others.

Edit :

To all the 'don't do it' people, let me quote the following from Stack Overflow

While it is true that asking regexes to parse arbitrary HTML is like asking Paris Hilton to write an operating system, it's sometimes appropriate to parse a limited, known set of HTML.

In this particular case, it's a bunch of text in a div, that stays consistent within many pages. I just want to get rid of a few cases (1% at most) where the users have included spans, strongs and a few other formatting tags. It is not worth more than the time to regexp it out as it barely happens over the thousands of pages I process. If you have a better, faster to implement idea, feel free to post it as an answer ;)

Edit 2

So many comments, I feel like adding a disclaimer : Using Regexp to parse HTML is bad. It will not work consistently and there are much better ways. Domparser has been mentioned; there's Cheerio or jsdom on Node.js, and a lot more libraries that will parse a HTML document correctly (in 99% cases). In that case, it is more like a string that happens to contain a few <...> that I needed to remove.


回答1:


Try this:

/(<((?!br)[^>]+)>)/ig



回答2:


Use a DOMParser to parse your string, then traverse it (I used the code in this question), extracting the parts that you are interested in:

var str = "<div>some text <span>some more</span><br /><a href='#'>a link</a>";
var parser = new DOMParser();
var dom = parser.parseFromString(str, "text/html");
var text = "";
var walkDOM = function (node, func) {
    func(node);
    node = node.firstChild;
    while (node) {
        walkDOM(node,func);
        node = node.nextSibling;
    }
};

walkDOM(dom, function (node) {
    if (node.tagName === 'BR') {
        text += node.outerHTML;
    }
    else if (node.nodeType === 3) { // Text node
        text += node.nodeValue;
    }        
});

alert(text);



回答3:


This might work. But, no matter the regex, it will fail to parse html.

 # /(?!<\/?br\s*\/?>)<[^>]+>/g

 (?! < /? br \s* /? > )
 < [^>]+ >



回答4:


I ended up using :

.replace('<br>','%br%').replace(/(<([^>]+)>)/g,'')

then I split on the '%br%' instead of the regular br tag. It is not an HTML parser, I am sure it will fail to parse 100% of the World Wide Web, and it solves my particular problem 100% of the time (just tried and tested).



来源:https://stackoverflow.com/questions/25877030/regexp-to-remove-all-html-tags-except-br

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!