How to split a comma separated String while ignoring escaped commas?

后端 未结 4 1365
我寻月下人不归
我寻月下人不归 2020-11-29 03:43

I need to write a extended version of the StringUtils.commaDelimitedListToStringArray function which gets an additional parameter: the escape char.

so calling my:

4条回答
  •  时光说笑
    2020-11-29 04:06

    As matt b said, [^\\], will interpret the character preceding the comma as a part of the delimiter.

    "test\\\\\\,test\\\\,test\\,test,test"
      -(split)->
    ["test\\\\\\,test\\\\,test\\,tes" , "test"]
    

    As drvdijk said, (? will misinterpret escaped backslashes.

    "test\\\\\\,test\\\\,test\\,test,test"
      -(split)->
    ["test\\\\\\,test\\\\,test\\,test" , "test"]
      -(unescape commas)->
    ["test\\\\,test\\,test,test" , "test"]
    

    I would expect being able to escape backslashes as well...

    "test\\\\\\,test\\\\,test\\,test,test"
      -(split)->
    ["test\\\\\\,test\\\\" , "test\\,test" , "test"]
      -(unescape commas and backslashes)->
    ["test\\,test\\" , "test,test" , "test"]
    

    drvdijk suggested (?<=(? which works well for lists with elements ending with up to 100 backslashes. This is far enough... but why a limit? Is there a more efficient way (isn't lookbehind greedy)? What about invalid strings?

    I searched for a while for a generic solution, then I wrote the thing myself... The idea is to split following a pattern that matches the list elements (instead of matching the delimiter).

    My answer does not take the escape character as a parameter.

    public static List commaDelimitedListStringToStringList(String list) {
        // Check the validity of the list
        // ex: "te\\st" is not valid, backslash should be escaped
        if (!list.matches("^(([^\\\\,]|\\\\,|\\\\\\\\)*(,|$))+")) {
            // Could also raise an exception
            return null;
        }
        // Matcher for the list elements
        Matcher matcher = Pattern
                .compile("(?<=(^|,))([^\\\\,]|\\\\,|\\\\\\\\)*(?=(,|$))")
                .matcher(list);
        ArrayList result = new ArrayList();
        while (matcher.find()) {
            // Unescape the list element
            result.add(matcher.group().replaceAll("\\\\([\\\\,])", "$1"));
        }
        return result;
    }
    

    Description for the pattern (unescaped):

    (?<=(^|,)) forward is start of string or a ,

    ([^\\,]|\\,|\\\\)* the element composed of \,, \\ or characters wich are neither \ nor ,

    (?=(,|$)) behind is end of string or a ,

    The pattern may be simplified.

    Even with the 3 parsings (matches + find + replaceAll), this method seems faster than the one suggested by drvdijk. It can still be optimized by writing a specific parser.

    Also, what is the need of having an escape character if only one character is special, it could simply be doubled...

    public static List commaDelimitedListStringToStringList2(String list) {
        if (!list.matches("^(([^,]|,,)*(,|$))+")) {
            return null;
        }
        Matcher matcher = Pattern.compile("(?<=(^|,))([^,]|,,)*(?=(,|$))")
                        .matcher(list);
        ArrayList result = new ArrayList();
        while (matcher.find()) {
            result.add(matcher.group().replaceAll(",,", ","));
        }
        return result;
    }
    

提交回复
热议问题