Count occurrences of certain words in pandas dataframe

后端 未结 2 1012
北荒
北荒 2020-12-01 07:53

I want to count number of occurrences of certain words in a data frame. I know using \"str.contains\"

a = df2[df2[\'col1\'].str.contains(\"sample\")].groupby         


        
相关标签:
2条回答
  • 2020-12-01 08:45

    To count the total number of matches, use s.str.match(...).str.get(0).count().

    If your regex will be matching several unique words, to be tallied individually, use s.str.match(...).str.get(0).groupby(lambda x: x).count()

    It works like this:

    In [12]: s
    Out[12]: 
    0    ax
    1    ay
    2    bx
    3    by
    4    bz
    dtype: object
    

    The match string method handles regular expressions...

    In [13]: s.str.match('(b[x-y]+)')
    Out[13]: 
    0       []
    1       []
    2    (bx,)
    3    (by,)
    4       []
    dtype: object
    

    ...but the results, as given, are not very convenient. The string method get takes the matches as strings and converts empty results to NaNs...

    In [14]: s.str.match('(b[x-y]+)').str.get(0)
    Out[14]: 
    0    NaN
    1    NaN
    2     bx
    3     by
    4    NaN
    dtype: object
    

    ...which are not counted.

    In [15]: s.str.match('(b[x-y]+)').str.get(0).count()
    Out[15]: 2
    
    0 讨论(0)
  • 2020-12-01 08:49

    Update: Original answer counts those rows which contain a substring.

    To count all the occurrences of a substring you can use .str.count:

    In [21]: df = pd.DataFrame(['hello', 'world', 'hehe'], columns=['words'])
    
    In [22]: df.words.str.count("he|wo")
    Out[22]:
    0    1
    1    1
    2    2
    Name: words, dtype: int64
    
    In [23]: df.words.str.count("he|wo").sum()
    Out[23]: 4
    

    The str.contains method accepts a regular expression:

    Definition: df.words.str.contains(self, pat, case=True, flags=0, na=nan)
    Docstring:
    Check whether given pattern is contained in each string in the array
    
    Parameters
    ----------
    pat : string
        Character sequence or regular expression
    case : boolean, default True
        If True, case sensitive
    flags : int, default 0 (no flags)
        re module flags, e.g. re.IGNORECASE
    na : default NaN, fill value for missing values.
    

    For example:

    In [11]: df = pd.DataFrame(['hello', 'world'], columns=['words'])
    
    In [12]: df
    Out[12]:
       words
    0  hello
    1  world
    
    In [13]: df.words.str.contains(r'[hw]')
    Out[13]:
    0    True
    1    True
    Name: words, dtype: bool
    
    In [14]: df.words.str.contains(r'he|wo')
    Out[14]:
    0    True
    1    True
    Name: words, dtype: bool
    

    To count the occurences you can just sum this boolean Series:

    In [15]: df.words.str.contains(r'he|wo').sum()
    Out[15]: 2
    
    In [16]: df.words.str.contains(r'he').sum()
    Out[16]: 1
    
    0 讨论(0)
提交回复
热议问题