How to use REGEX to split text to chunks, broken on specific chars?

↘锁芯ラ 提交于 2021-02-17 02:42:29

问题


  1. I wish to split a long text into chunks of 1000 chars max,
  2. To take as much chars as I can in each chunk but importantly I want to finish each chunk in a linebreak inorder to avoid word split in the middle.
  3. If there was no single linebreak in all of the 1000 chars, then I regex will still capture, and split a word to 2 chunks.

This Regex /.{1,1000}/gs will split the text to chunks of 1000 chars but it may break a word in the middle.

What Regex will give me the wanted results?


回答1:


You can use .{1,1000}\b which will split to the word boundary just before the 1000th char




回答2:


From what I have understood this should do the trick

/(.{1,1000}$)|(.{1,1000})/gm

It will capture either (.{1,1000}$), any char sequence ending with a line break with a maximum of 1001 ( couting line break ) char

Or

(.{1,1000}) , as no linebreak was found, just cut a word

/!\ Pay attention to the number, you might wanna change them depending on wether or not you want to count the line break within the 1000 char limit


Note : If you want to prevent word from breaking, you can use a word bound instead of a line break as separator, which gives you

(.{1,1000}\b)|(.{1,1000})



回答3:


Try this it uses PCRE syntax

/(?=^.{1,1000}?$).*\n|.{1,1000}/gm

It first does a positive lookahead to ensure that the line has less than 1000 characters. Then it captures up to and including the first line break. If there are more than 1000 characters it just captures the first 1000. The /g flag lets you do this multiple times and the /m flag makes ^ and $ match on start and end of line rather than the whole text.




回答4:


Try the following regex:

/(?<=^).{1,50}(?:\n|$)|.{1,50}(?:\n|$)|.{1,50}/gms

For testing purpose I used quantifier "up to 50", but in the final version you should change it to 1000 or whatever other limit of your choice.

It contains 3 alternatives:

  1. Up to n chars, including \n (s option). This chunk must start after a line break or from the start of the whole string (m option). It must end on a newline or at the end of the whole string.
  2. Up to n chars, ending as above.
  3. Up to n chars, with no other requirements.

The order of alternatives is important, because the regex engine tries them in the order of appearance.

For a working example see https://regex101.com/r/2aN49j/1




回答5:


You may want to try with this: [\s\S]{1,999}\W|[\s\S]{1,1000}

Please see the Demo. I think It should meet all three requirements (at the end of the demo you'll find 'big words' also)

Explained:

  # Option 1: It ends with word boundary
  [\s\S]      # Any character (also \n)
  {1,999}     # repeated 1 to 999 times
  \W          # any non-word character
  # Option 2: (backup) Just the 1000 characters 
  #           (if no word boundary exists; for long words)
| [\s\S]
  {1,1000}


来源:https://stackoverflow.com/questions/51984385/how-to-use-regex-to-split-text-to-chunks-broken-on-specific-chars

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!