I want to split the string in python.
Sample string:
Hi this is ACT I. SCENE 1 and SCENE 2 and this is ACT II. SCENE 1 and SCENE 2 and more
Here is a working script, albeit a bit hackish:
inp = "Hi this is ACT I. SCENE 1 and SCENE 2 and this is ACT II. SCENE 1 and SCENE 2 and more"
parts = re.findall(r'[A-Z]{2,}(?: [A-Z0-9.]+)*|(?![A-Z]{2})\w+(?: (?![A-Z]{2})\w+)*', inp)
print(parts)
This prints:
['Hi this is', 'ACT I. SCENE 1', 'and', 'SCENE 2', 'and this is', 'ACT II. SCENE 1',
'and', 'SCENE 2', 'and more']
An explanation of the regex logic, which uses an alternation to match one of two cases:
[A-Z]{2,} match TWO or more capital letters
(?: [A-Z0-9.]+)* followed by zero or more words, consisting only of
capital letters, numbers, or period
| OR
(?![A-Z]{2})\w+ match a word which does NOT start with two capital letters
(?: (?![A-Z]{2})\w+)* then match zero or more similar terms