Split string with “.” (dot) while handling abbreviations

前端未结

关注

 2  1109

I\'m finding this fairly hard to explain, so I\'ll kick off with a few examples of before/after of what I\'d like to achieve.

Example of input:

相关标签:

2条回答

面向向阳花

2021-01-12 03:08

Since every word starts with a capital (uppercase) letter, I would suggest that you first remove all dots, and replace it with no space (""). Then, iterate over all characters and put space between lowercase letter and following uppercase letter. Also, if you encounter an uppercase with following lowercase, put the space before the uppercase.

It will work for all examples you provided, but I am not sure if there are any exceptions to my observation.

0 讨论(0)
发布评论:

提交评论
- 加载中...
野的像风

2021-01-12 03:17
How about removing dots that need to disappear with regex, and then replace rest of dots with space? Regex can look like (?<=(^|[.])[\\S&&\\D])[.](?=[\\S&&\\D]([.]|$)).
```
String[] data = { 
        "Hello.World", 
        "This.Is.A.Test", 
        "The.S.W.A.T.Team",
        "S.w.a.T.", 
        "S.w.a.T.1", 
        "2001.A.Space.Odyssey" };

for (String s : data) {
    System.out.println(s.replaceAll(
            "(?<=(^|[.])[\\S&&\\D])[.](?=[\\S&&\\D]([.]|$))", "")
            .replace('.', ' '));
}
```
result
```
Hello World
This Is A Test
The SWAT Team
SwaT 
SwaT 1
2001 A Space Odyssey
```
In regex I needed to escape special meaning of dot characters. I could do it with \\. but I prefer [.].

So at canter of regex we have dot literal. Now this dot is surrounded with (?<=...) and (?=...). These are parts of look-around mechanism called look-behind and look-ahead.
- Since dots that need to be removed have dot (or start of data ^) and some non-white-space \\S that is also non-digit \D character before it I can test it using (?<=(^|[.])[\\S&&\\D])[.].
- Also dot that needs to be removed have also non-white-space and non-digit character and another dot (optionally end of data $) after it, which can be written as [.](?=[\\S&&\\D]([.]|$))
Depending on needs [\\S&&\\D] which beside letters also matches characters like !@#$%^&*()-_=+... can be replaced with [a-zA-Z] for only English letters, or \\p{IsAlphabetic} for all letters in Unicode.
0 讨论(0)
发布评论:

提交评论
- 加载中...