Extracting words from a string, removing punctuation and returning a list with separated words

后端 未结 3 1111
小蘑菇
小蘑菇 2020-12-03 07:50

I was wondering how to implement a function get_words() that returns the words in a string in a list, stripping away the punctuation.

How I would like t

相关标签:
3条回答
  • 2020-12-03 08:22

    Try to use re:

    >>> [w for w in re.split('\W', 'Hello world, my name is...James!') if w]
    ['Hello', 'world', 'my', 'name', 'is', 'James']
    

    Although I'm not sure that it will catch all your use cases.

    If you want to solve it in another way, you may specify characters that you want to be in result:

    >>> re.findall('[%s]+' % string.ascii_letters, 'Hello world, my name is...James!')
    ['Hello', 'world', 'my', 'name', 'is', 'James']
    
    0 讨论(0)
  • 2020-12-03 08:22

    All you need is a tokenizer. Have a look at nltk and especially at WordPunctTokenizer.

    0 讨论(0)
  • 2020-12-03 08:27

    This has nothing to do with splitting and punctuation; you just care about the letters (and numbers), and just want a regular expression:

    import re
    def getWords(text):
        return re.compile('\w+').findall(text)
    

    Demo:

    >>> re.compile('\w+').findall('Hello world, my name is...James the 2nd!')
    ['Hello', 'world', 'my', 'name', 'is', 'James', 'the', '2nd']
    

    If you don't care about numbers, replace \w with [A-Za-z] for just letters, or [A-Za-z'] to include contractions, etc. There are probably fancier ways to include alphabetic-non-numeric character classes (e.g. letters with accents) with other regex.


    I almost answered this question here: Split Strings with Multiple Delimiters?

    But your question is actually under-specified: Do you want 'this is: an example' to be split into:

    • ['this', 'is', 'an', 'example']
    • or ['this', 'is', 'an', '', 'example']?

    I assumed it was the first case.


    [this', 'is', 'an', example'] is what i want. is there a method without importing regex? If we can just replace the non ascii_letters with '', then splitting the string into words in a list, would that work? – James Smith 2 mins ago

    The regexp is the most elegant, but yes, you could this as follows:

    def getWords(text):
        """
            Returns a list of words, where a word is defined as a
            maximally connected substring of uppercase or lowercase
            alphabetic letters, as defined by "a".isalpha()
    
            >>> get_words('Hello world, my name is... Élise!')  # works in python3
            ['Hello', 'world', 'my', 'name', 'is', 'Élise']
        """
        return ''.join((c if c.isalnum() else ' ') for c in text).split()
    

    or .isalpha()


    Sidenote: You could also do the following, though it requires importing another standard library:

    from itertools import *
    
    # groupby is generally always overkill and makes for unreadable code
    # ... but is fun
    
    def getWords(text):
        return [
            ''.join(chars)
                for isWord,chars in 
                groupby(' My name, is test!', lambda c:c.isalnum()) 
                if isWord
        ]
    

    If this is homework, they're probably looking for an imperative thing like a two-state Finite State Machine where the state is "was the last character a letter" and if the state changes from letter -> non-letter then you output a word. Don't do that; it's not a good way to program (though sometimes the abstraction is useful).

    0 讨论(0)
提交回复
热议问题