Multiple matches in single java regexp

为君一笑 提交于 2021-01-04 02:44:34

问题


Is it possible to match the following in a single regular expression to get the first word, and then a list of the numbers?

this 10 12 3 44 5 66 7 8    # should return "this", "10", "12", ...
another 1 2 3               # should return "another", "1", "2", "3"

EDIT1: My actual data is not this simple, the digits are actually more complex patterns, but for illustration purposes, I've reduced the problem to simple digits, so I do require a regex answer.

The numbers are unknown in length on each line, but all match a simple pattern.

The following only matches "this" and "10":

([\p{Alpha}]+ )(\d+ ?)+?

Dropping the final ? matches "this" and "8".

I had thought that the final group (\d+ ?)+ would do the digit matching multiple times, but it doesn't and I can't find the syntax to do it, if possible.

I can do it in multiple passes, by only searching for the name and latter numbers separately, but was wondering if it's possible in a single expression? (And if not, is there a reason?)


EDIT2: As I mentioned in some of the comments, this was a question in Advent of Code (Day 7, 2020). I was looking to find cleanest solution (who doesn't love a bit of polishing?)

Here's my ultimate solution (kotlin) I used, but spent too long trying to do it in 1 regex, so I posted this question.

val bagExtractor = Regex("""^([\p{Alpha} ]+) bags contain""")
val rulesExtractor = Regex("""([\d]+) ([\p{Alpha} ]+) bag""")

// bagRule is a line from the input
val bag = bagExtractor.find(bagRule)?.destructured!!.let { (n) -> Bag(name = n) }
val contains = rulesExtractor.findAll(bagRule).map { it.destructured.let { (num, bagName) -> Contain(num = num.toInt(), bag = Bag(bagName)) } }.toList()
Rule(bag = bag, contains = contains)

Despite now knowing it can be done in 1 line, I haven't implemented it, as I think it's cleaner in 2.


回答1:


I think what you are looking for can be achieved by splitting the string on \s+ unless I am missing something.

import java.util.Arrays;

public class Main {
    public static void main(String[] args) {
        String str = "this 10 12 3 44 5 66 7 8";
        String[] parts = str.split("\\s+");
        System.out.println(Arrays.toString(parts));
    }
}

Output:

[this, 10, 12, 3, 44, 5, 66, 7, 8]

If you want to select just the alphabetical text and the integer text from the string, you can do it as

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
    public static void main(String[] args) {
        String str = "this 10 12 3 44 5 66 7 8";
        Matcher matcher = Pattern.compile("(\\b\\p{Alpha}+\\b)|(\\b\\d+\\b)").matcher(str);
        while (matcher.find()) {
            System.out.println(matcher.group());
        }
    }
}

Output:

this
10
12
3
44
5
66
7
8

or as

import java.util.List;
import java.util.regex.MatchResult;
import java.util.regex.Pattern;
import java.util.stream.Collectors;

public class Main {
    public static void main(String[] args) {
        String str = "this 10 12 3 44 5 66 7 8";

        List<String> list = Pattern.compile("(\\b\\p{Alpha}+\\b)|(\\b\\d+\\b)")
                            .matcher(str)
                            .results()
                            .map(MatchResult::group)                                                        
                            .collect(Collectors.toList());

        System.out.println(list);
    }
}

Output:

[this, 10, 12, 3, 44, 5, 66, 7, 8]



回答2:


No. The notion of "find me all of a certain regexp" is just not done with incrementing groups. You're really asking for why regexp is what it is? That's... an epic thesis that delves into some ancient computing history and a lot of Larry Wall (author of Perl, which is more or less where regexps came from) interviews, that seems a bit beyond the scope of SO. They work that way because regexps work that way, and those work that way because they've worked that way for decades and changing them would mess with people's expectations; let's not go any deeper than that.

You can do this with scanners instead:

Scanner s = new Scanner("this 10 12 3 44 5 66 7 8");
assertEquals("this", s.next());
assertEquals(10, s.nextInt());
// etc

or even:

Scanner s = new Scanner("this 10 12 3 44 5 66 7 8");
assertEquals("this", s.next());
assertEquals(10, s.nextInt());
// etc

or even:

Scanner s = new Scanner("this 10 12 3 44 5 66 7 8");
assertEquals("this", s.next(Pattern.compile("[\p{Alpha}]+"));
assertEquals(10, s.nextInt());

s = new Scanner("--00invalid-- 10 12 3 44 5 66 7 8");
// the line below will throw an InputMismatchException
s.next(Pattern.compile("[\p{Alpha}]+"));

NB: Scanners tokenize (they split the input into a sequence of token, separator, token, separator, etc - then tosses the separators and gives you the tokens). .next(Pattern) does not mean: Keep scanning until you hit something that matches. It just means: Grab the next token. If it matches this regexp, great, return it. Otherwise, crash.

So, the real magic is in making scanner tokenize as you want. This is done by use .useDelimiter() and is also regexp based. Some fancy footwork with positive lookahead and co can get you far, but it's not infinitely powerful. You didn't expand on the actual structure of your input so I can't say if it'll suffice for your needs.




回答3:


Assuming you are talking about this: adventofcode where the inputs are the rules

light red bags contain 1 bright white bag, 2 muted yellow bags.
dark orange bags contain 3 bright white bags, 4 muted yellow bags.
bright white bags contain 1 shiny gold bag.
muted yellow bags contain 2 shiny gold bags, 9 faded blue bags.
shiny gold bags contain 1 dark olive bag, 2 vibrant plum bags.
dark olive bags contain 3 faded blue bags, 4 dotted black bags.
vibrant plum bags contain 5 faded blue bags, 6 dotted black bags.
faded blue bags contain no other bags.
dotted black bags contain no other bags.

Why search for a complicated regular expression when you can easily split on the word contain or on a ,

String str1 = "light red bags contain 1 bright white bag, 2 muted yellow bags.";
String str2 = "dotted black bags contain no other bags.";
String[] split1 = str1.split("\\scontain\\s|,");
String[] split2 = str2.split("\\scontain\\s|,");

System.out.println(Arrays.toString(split1));
System.out.println(Arrays.toString(split2));

//[light red bags, 1 bright white bag,  2 muted yellow bags.]
//[dotted black bags, no other bags.]



回答4:


You said you had to use a regex. But how about a hybrid solution. Use the regex to verify the format and then split the values on spaces or the delimiter of your choosing. I also returned the value in an optional so you could check on its availability before use.

String[] data = { "this 10 12 3 44 5 66 7 8",
        "Bad Data 5 5 5",
        "another 1 2 3" };

for (String text : data) {
    Optional<List<String>> op = parseText(text);
    if (!op.isEmpty()) {
        System.out.println(op.get());
    }
}

Prints

[this, 10, 12, 3, 44, 5, 66, 7, 8]
[another, 1, 2, 3]
static String pattern = "([a-zA-Z]+)(\\s+\\d+)+";
    
public static Optional<List<String>> parseText(String text) {
    if (text.matches(pattern)) {
        return Optional.of(Arrays.stream(text.split("\\s+"))
                .collect(Collectors.toList()));
    }
    return Optional.empty();
}


来源:https://stackoverflow.com/questions/65186269/multiple-matches-in-single-java-regexp

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!