Split string with “.” (dot) while handling abbreviations

前端 未结 2 1109
爱一瞬间的悲伤
爱一瞬间的悲伤 2021-01-12 02:47

I\'m finding this fairly hard to explain, so I\'ll kick off with a few examples of before/after of what I\'d like to achieve.

Example of input:

<
相关标签:
2条回答
  • 2021-01-12 03:08

    Since every word starts with a capital (uppercase) letter, I would suggest that you first remove all dots, and replace it with no space (""). Then, iterate over all characters and put space between lowercase letter and following uppercase letter. Also, if you encounter an uppercase with following lowercase, put the space before the uppercase.

    It will work for all examples you provided, but I am not sure if there are any exceptions to my observation.

    0 讨论(0)
  • 2021-01-12 03:17

    How about removing dots that need to disappear with regex, and then replace rest of dots with space? Regex can look like (?<=(^|[.])[\\S&&\\D])[.](?=[\\S&&\\D]([.]|$)).

    String[] data = { 
            "Hello.World", 
            "This.Is.A.Test", 
            "The.S.W.A.T.Team",
            "S.w.a.T.", 
            "S.w.a.T.1", 
            "2001.A.Space.Odyssey" };
    
    for (String s : data) {
        System.out.println(s.replaceAll(
                "(?<=(^|[.])[\\S&&\\D])[.](?=[\\S&&\\D]([.]|$))", "")
                .replace('.', ' '));
    }
    

    result

    Hello World
    This Is A Test
    The SWAT Team
    SwaT 
    SwaT 1
    2001 A Space Odyssey
    

    In regex I needed to escape special meaning of dot characters. I could do it with \\. but I prefer [.].

    So at canter of regex we have dot literal. Now this dot is surrounded with (?<=...) and (?=...). These are parts of look-around mechanism called look-behind and look-ahead.

    • Since dots that need to be removed have dot (or start of data ^) and some non-white-space \\S that is also non-digit \D character before it I can test it using (?<=(^|[.])[\\S&&\\D])[.].

    • Also dot that needs to be removed have also non-white-space and non-digit character and another dot (optionally end of data $) after it, which can be written as [.](?=[\\S&&\\D]([.]|$))


    Depending on needs [\\S&&\\D] which beside letters also matches characters like !@#$%^&*()-_=+... can be replaced with [a-zA-Z] for only English letters, or \\p{IsAlphabetic} for all letters in Unicode.

    0 讨论(0)
提交回复
热议问题