Why are most string manipulations in Java based on regexp?

久未见 提交于 2019-12-02 15:47:22

Note that the regex need not be recompiled each time. From the Javadoc:

An invocation of this method of the form str.split(regex, n) yields the same result as the expression

Pattern.compile(regex).split(str, n) 

That is, if you are worried about performance, you may precompile the pattern and then reuse it:

Pattern p = Pattern.compile(regex);
...
String[] tokens1 = p.split(str1); 
String[] tokens2 = p.split(str2); 
...

instead of

String[] tokens1 = str1.split(regex);
String[] tokens2 = str2.split(regex);
...

I believe that the main reason for this API design is convenience. Since regular expressions include all "fixed" strings/chars too, it simplifies the API to have one method instead of several. And if someone is worried about performance, the regex can still be precompiled as shown above.

My feeling (which I can't back with any statistical evidence) is that most of the cases String.split() is used in a context where performance is not an issue. E.g. it is a one-off action, or the performance difference is negligible compared to other factors. IMO rare are the cases where you split strings using the same regex thousands of times in a tight loop, where performance optimization indeed makes sense.

It would be interesting to see a performance comparison of a regex matcher implementation with fixed strings/chars compared to that of a matcher specialized to these. The difference might not be big enough to justify the separate implementation.

I wouldn't say most string manipulations are regex-based in Java. Really we are only talking about split and replaceAll/replaceFirst. But I agree, it's a big mistake.

Apart from the ugliness of having a low-level language feature (strings) becoming dependent on a higher-level feature (regex), it's also a nasty trap for new users who might naturally assume that a method with the signature String.replaceAll(String, String) would be a string-replace function. Code written under that assumption will look like it's working, until a regex-special character creeps in, at which point you've got confusing, hard-to-debug (and maybe even security-significant) bugs.

It's amusing that a language that can be so pedantically strict about typing made the sloppy mistake of treating a string and a regex as the same thing. It's less amusing that there's still no builtin method to do a plain string replace or split. You have to use a regex replace with a Pattern.quoted string. And you only even get that from Java 5 onwards. Hopeless.

@Tim Pietzcker:

Are there other languages that do the same?

JavaScript's Strings are partly modelled on Java's and are also messy in the case of replace(). By passing in a string, you get a plain string replace, but it only replaces the first match, which is rarely what's wanted. To get a replace-all you have to pass in a RegExp object with the /g flag, which again has problems if you want to create it dynamically from a string (there is no built-in RegExp.quote method in JS). Luckily, split() is purely string-based, so you can use the idiom:

s.split(findstr).join(replacestr)

Plus of course Perl does absolutely everything with regexen, because it's just perverse like that.

(This is a comment more than an answer, but is too big for one. Why did Java do this? Dunno, they made a lot of mistakes in the early days. Some of them have since been fixed. I suspect if they'd thought to put regex functionality in the box marked Pattern back in 1.0, the design of String would be cleaner to match.)

I imagine a good reason is that they can simply pass the buck on to the regex method, which does all the real heavy lifting for all of the string methods. Im guessing they thought if they already had a working solution it was less efficient, from a development and maintenance standpoint, to reinvent the wheel for each string manipulation method.

Interesting discussion!

Java was not originally intended as a batch programming language. As such the API out of the box are more tuned towards doing one "replace" , one "parse" etc. except on Application initialization when the app may be expected to be parsing a bunch of configuration files.

Hence optimization of these APIs was sacrificed in the altar of simplicity IMO. But the question brings up an important point. Python's desire to keep the regex distinct from the non regex in its API, stems from the fact that Python can be used as an excellent scripting language as well. In UNIX too, the original versions of fgrep did not support regex.

I was engaged in a project where we had to do some amount of ETL work in java. At that time, I remember coming up with the kind of optimizations that you have alluded to, in your question.

I suspect that the reason why things like String#split(String) use regexp under the hood is because it involves less extraneous code in the Java Class Library. The state machine resulting from a split on something like , or space is so simple that it is unlikely to be significantly slower to execute than a statically implemented equivalent using a StringCharacterIterator.

Beyond that the statically implemented solution would complicate runtime optimization with the JIT because it would be a different block of code that also requires hot code analysis. Using the existing Pattern algorithms regularly across the library means that they are more likely candidates for JIT compilation.

Very good question..

I suppose when the designers sat down to look at this (and not for very long, it seems), they came at it from a point of view that it should be designed to suit as many different possibilities as possible. Regular Expressions offered that flexibility.

They didn't think in terms of efficiencies. There is the Java Community Process available to raise this.

Have you looked at using the java.util.regex.Pattern class, where you compile the expression once and then use on different strings.

Pattern exp = Pattern.compile(":");
String[] array = exp.split(sourceString1);
String[] array2 = exp.split(sourceString2);

In looking at the Java String class, the uses of regex seem reasonable, and there are alternatives if regex is not desired:

http://java.sun.com/javase/6/docs/api/java/lang/String.html

boolean matches(String regex) - A regex seems appropriate, otherwise you could just use equals

String replaceAll/replaceFirst(String regex, String replacement) - There are equivalents that take CharSequence instead, preventing regex.

String[] split(String regex, int limit) - A powerful but expensive split, you can use StringTokenizer to split by tokens.

These are the only functions I saw that took regex.

Edit: After seeing that StringTokenizer is legacy, I would defer to Péter Török's answer to precompile the regex for split instead of using the tokenizer.

The answer to your question is that the Java core API did it wrong. For day to day work you can consider using Guava libraries' CharMatcher which fills the gap beautifully.

...why was the Java API chosen the way it is now?

Short answer: it wasn't. Nobody ever decided to favor regex methods over non-regex methods in the String API, it just worked out that way.

I always understood that Java's designers deliberately kept the string-manipulation methods to a minimum, in order to avoid API bloat. But when regex support came along in JDK 1.4, of course they had to add some convenience methods to String's API.

So now users are faced with a choice between the immensely powerful and flexible regex methods, and the bone-basic methods that Java always offered.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!