How to tokenize an String like in lexer in java?

问题

Lines of code which i want to tokenize according to lexer rules.

  String input1 = input.replaceAll("\\s+"," ");

     List<String> uncleanList = Arrays.asList(input1.split(" "));

I put this code in String. replaced all multi spaces into one space

String s = codeString.replaceAll("\\s+"," ");

then

  String t= s.split(" ")

method on that string which gave me an array with the help of single space. i got this Array result(This is console output by System.out.println(Arrays.toString(s)); ):

[String, input1, =, input.replaceAll("\\s+",", ");, List<String>, uncleanList, =, Arrays.asList(input1.split(", "));]

but there are alot of ( ) brackets, <> , . dots "" etc without space separation. Now i am stuck here . How to separate symbols from alphabets or numbers on separate index .

Desired Array output when print Array on console:

 [String, input1, =, input,.,replaceAll,(,"\\s+"," ",),;, List,<,String,>, uncleanList, =, Arrays,.,asList,(,input1,.,split,(," ",),),;,]

回答1:

When you don't have a delimiter to use, split stops being an effective way to do tokenization. Instead of using split to find the parts you don't want, use find to find the parts you do want, like this:

Pattern pattern = Pattern.compile("\\w+|[+-]?[0-9\\._Ee]+|\\S");
Matcher matcher = pattern.matcher(input);

// Find all matches
while (matcher.find()) {
  String token = matcher.group();
}

The example regex I provide here is simpler that what you really want. The important thing is that you provide the default pattern (\S) to match any non-whitespace character that isn't included in a longer match. That will take care of all the single-character tokens.

Some of the longer tokens you have to match, like strings and comments, are pretty complicated, so it will take some work to get this right.

来源：https://stackoverflow.com/questions/58920961/how-to-tokenize-an-string-like-in-lexer-in-java

标签

regex

string

algorithm

split