Extracting whole words

前端 未结 4 2106
借酒劲吻你
借酒劲吻你 2020-12-03 15:38

I have a large set of real-world text that I need to pull words out of to input into a spell checker. I\'d like to extract as many meaningful words as possible with

4条回答
  •  遥遥无期
    2020-12-03 16:11

    If you restrict yourself to ASCII letters, then use (with the re.I option set)

    \b[a-z]+\b
    

    \b is a word boundary anchor, matching only at the start and end of alphanumeric "words". So \b[a-z]+\b matches pie, but not pie21 or 21pie.

    To also allow other non-ASCII letters, you can use something like this:

    \b[^\W\d_]+\b
    

    which also allows accented characters etc. You may need to set the re.UNICODE option, especially when using Python 2, in order to allow the \w shorthand to match non-ASCII letters.

    [^\W\d_] as a negated character class allows any alphanumeric character except for digits and underscore.

提交回复
热议问题