Java regex performance

一世执手 提交于 2019-11-29 11:42:50

Hint: Don't use regexes for link extraction or other HTML "parsing" tasks!

Your regex has 6 (SIX) repeating groups in it. Executing it will entail a lot of backtracking. In the worst case, it could even approach O(N^6) where N is the number of input characters. You could ease this a bit by replacing eager matching with lazy matching, but it is almost impossible to avoid pathological cases; e.g. when the input data is sufficiently malformed that the regex does not match.

A far, far better solution is to use some existing strict or permissive HTML parser. Even writing an ad-hoc parser by hand is going to be better than using gnarly regexes.

This page that lists various HTML parsers for Java. I've heard good things about TagSoup and HtmlCleaner.

All your time, all of it, is being spent here:

 html+=line;

Use a StringBuffer. Better still, if you can, run the match on every line and don't accumulate them at all.

Jonny

I have written simple test for comparing 10 million operation RegExp performance against String.indexof() with the following result:

0.447 seconds
6.174 seconds
13.812080536912752 times regexp longer.

import java.util.regex.Pattern;

public class TestRegExpSpeed {
    public static void main(String[] args) {
        String match = "FeedUserMain_231_Holiday_Feed_MakePresent-1_";
        String unMatch = "FeedUserMain_231_Holiday_Feed_Make2Present-1_";

        long start = System.currentTimeMillis();
        for (int i = 0; i <= 10000000; i++) {
            if (i % 2 == 0) {
                match.indexOf("MakePresent");
            } else {
                unMatch.indexOf("MakePresent");
            }
        }

        double indexOf = (System.currentTimeMillis() - start) / 1000.;
        System.out.println(indexOf + " seconds");

        start = System.currentTimeMillis();
        Pattern compile = Pattern.compile(".*?MakePresent.*?");
        for (int i = 0; i <= 10000000; i++) {
            if (i % 2 == 0) {
                compile.matcher(match).matches();
            } else {
                compile.matcher(unMatch).matches();
            }
        }
        double reaexp = (System.currentTimeMillis() - start) / 1000.;
        System.out.println(reaexp + " seconds");

        System.out.println(reaexp / indexOf + " times regexp longer. ");
    }
}

Try Jaunt instead. Please don't use regex for this.

Regex use vs. Regex abuse

Regular expressions are not Parsers. Although you can do some amazing things with regular expressions, they are weak at balanced tag matching. Some regex variants have balanced matching, but it is clearly a hack -- and a nasty one. You can often make it kinda-sorta work, as I have in the sanitize routine. But no matter how clever your regex, don't delude yourself: it is in no way, shape or form a substitute for a real live parser.

Source

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!