Splitting on regex without removing delimiters

后端 未结 5 730
花落未央
花落未央 2020-12-11 20:15

So, I would like to split this text into sentences.

s = \"You! Are you Tom? I am Danny.\"

so I get:

[\"You!\", \"Are you To         


        
相关标签:
5条回答
  • 2020-12-11 20:34

    If you prefer use split method rather than match, one solution split with group

    splitted = filter(None, re.split( r'(.*?[\.!\?])', s))
    

    Filter removes empty strings if any.

    This will work even if there is no spaces between sentences, or if you need catch trailing sentence that ends with a different punctuation sign, such as an unicode ellipses (or does have any at all)

    It even possible to keep you re as is (with escaping correction and adding parenthesis).

    splitted = filter(None, re.split( r'([\.!\?])', s))
    

    Then merge even and uneven elements and remove extra spaces

    Python split() without removing the delimiter

    0 讨论(0)
  • 2020-12-11 20:40

    Strictly speaking, you don't want to split on '!?.', but rather on the whitespace that follows those characters. The following will work:

    >>> import re
    >>> re.split(r'(?<=[\.\!\?])\s*', s)
    ['You!', 'Are you Tom?', 'I am Danny.']
    

    This splits on whitespace, but only if it is preceded by either a ., !, or ? character.

    0 讨论(0)
  • 2020-12-11 20:48

    If Python supported split by zero-length matches, you could achieve this by matching an empty string preceded by one of the delimiters:

    (?<=[.!?])
    

    Demo: https://regex101.com/r/ZLDXr1/1

    Unfortunately, Python does not support split by zero-length matches. Yet the solution may still be useful in other languages that support lookbehinds.

    However, based on you input/output data samples, you rather need to split by spaces preceded by one of the delimiters. So the regex would be:

    (?<=[.!?])\s+
    

    Demo: https://regex101.com/r/ZLDXr1/2

    Python demo: https://ideone.com/z6nZi5

    If the spaces are optional, the re.findall solution suggested by @Psidom is the best one, I believe.

    0 讨论(0)
  • 2020-12-11 20:49

    You can use re.findall with regex .*?[.!\?]; the lazy quantifier *? makes sure each pattern matches up to the specific delimiter you want to match on:

    import re
    
    s = """You! Are you Tom? I am Danny."""
    re.findall('.*?[.!\?]', s)
    # ['You!', ' Are you Tom?', ' I am Danny.']
    
    0 讨论(0)
  • 2020-12-11 20:56

    Easiest way is to use nltk.

    import nltk   
    nltk.sent_tokenize(s)
    

    It will return a list of all your sentences without loosing delimiters.

    0 讨论(0)
提交回复
热议问题