Split string into sentences in javascript

后端 未结 8 1181
悲哀的现实
悲哀的现实 2020-11-29 06:08

Currently i am working on an application that splits a long column into short ones. For that i split the entire text into words, but at the moment my regex splits numbers to

8条回答
  •  刺人心
    刺人心 (楼主)
    2020-11-29 06:27

    str.replace(/([.?!])\s*(?=[A-Z])/g, "$1|").split("|")
    

    Output:

    [ 'This is a long string with some numbers [125.000,55 and 140.000] and an end.',
      'This is another sentence.' ]
    

    Breakdown:

    ([.?!]) = Capture either . or ? or !

    \s* = Capture 0 or more whitespace characters following the previous token ([.?!]). This accounts for spaces following a punctuation mark which matches the English language grammar.

    (?=[A-Z]) = The previous tokens only match if the next character is within the range A-Z (capital A to capital Z). Most English language sentences start with a capital letter. None of the previous regexes take this into account.


    The replace operation uses:

    "$1|"
    

    We used one "capturing group" ([.?!]) and we capture one of those characters, and replace it with $1 (the match) plus |. So if we captured ? then the replacement would be ?|.

    Finally, we split the pipes | and get our result.


    So, essentially, what we are saying is this:

    1) Find punctuation marks (one of . or ? or !) and capture them

    2) Punctuation marks can optionally include spaces after them.

    3) After a punctuation mark, I expect a capital letter.

    Unlike the previous regular expressions provided, this would properly match the English language grammar.

    From there:

    4) We replace the captured punctuation marks by appending a pipe |

    5) We split the pipes to create an array of sentences.

提交回复
热议问题