How to remove text between tags

问题

I want to remove the content between <script></script>tags. I'm manually checking for the pattern and iterating using while loop. But, I'm getting StringOutOfBoundException at this line:

String script = source.substring(startIndex,endIndex-startIndex);

Below is the complete method:

public static String getHtmlWithoutScript(String source) {
    String START_PATTERN = "<script>";
    String END_PATTERN = " </script>";
    while (source.contains(START_PATTERN)) {
        int startIndex=source.lastIndexOf(START_PATTERN);
        int endIndex=source.indexOf(END_PATTERN,startIndex);

        String script=source.substring(startIndex,endIndex);
        source.replace(script,"");
    }
    return source;
}

Am I doing anything wrong here? And I'm getting endIndex=-1. Can anyone help me to identify, why my code is breaking.

回答1:

String text = "<script>This is dummy text to remove </script> dont remove this";
    StringBuilder sb = new StringBuilder(text);
    String startTag = "<script>";
    String endTag = "</script>";

    //removing the text between script
    sb.replace(text.indexOf(startTag) + startTag.length(), text.indexOf(endTag), "");

    System.out.println(sb.toString());

If you want to remove the script tags too add the following line :

sb.toString().replace(startTag, "").replace(endTag, "")

UPDATE :

If you dont want to use StringBuilder you can do this:

    String text = "<script>This is dummy text to remove </script> dont remove this";
    String startTag = "<script>";
    String endTag = "</script>";

    //removing the text between script
    String textToRemove = text.substring(text.indexOf(startTag) + startTag.length(), text.indexOf(endTag));
    text = text.replace(textToRemove, "");

    System.out.println(text);

回答2:

You can use a regex to remove the script tag content:

public String removeScriptContent(String html) {
         if(html != null) {
            String re = "<script>(.*)</script>";

            Pattern pattern = Pattern.compile(re);
            Matcher matcher = pattern.matcher(html);
            if (matcher.find()) {
                return html.replace(matcher.group(1), "");
            }
        }
        return null;
     }

You have to add this two imports:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

回答3:

I know I'm probably late to the party. But I would like to give you a regex (really tested solution).

What you have to note here is that when it comes to regular expressions, their engines are greedy by default. So a search string such as <script>(.*)</script> will match the entire string starting from <script> up until the end of the line, or end of the file depending on the regexp options used. This is due to the fact that the search engine uses greedy matching by default.

Now in order to perform the match that you want to in an accurate manner... you could use "lazy" searching.

Search with Lazy loading <script>(.*?)<\/script>

Now with that, you will get accurate results.

You can read more about about Regexp Lazy & Greedy in this answer.

来源：https://stackoverflow.com/questions/32843295/how-to-remove-text-between-script-script-tags

标签

java

html

html-parsing

How to remove text between <script></script> tags

问题

回答1:

回答2:

回答3: