Counting the Frequency of words in a pandas data frame

后端 未结 3 1131
天命终不由人
天命终不由人 2020-12-04 18:01

I have a table like below:

      URN                   Firm_Name
0  104472               R.X. Yah & Co
1  104873        Big Building Society
2  109986            


        
相关标签:
3条回答
  • 2020-12-04 18:28

    You need str.cat with lower first for concanecate all values to one string, then need word_tokenize and last use your solution:

    top_N = 4
    #if not necessary all lower
    a = data['Firm_Name'].str.lower().str.cat(sep=' ')
    words = nltk.tokenize.word_tokenize(a)
    word_dist = nltk.FreqDist(words)
    print (word_dist)
    <FreqDist with 17 samples and 20 outcomes>
    
    rslt = pd.DataFrame(word_dist.most_common(top_N),
                        columns=['Word', 'Frequency'])
    print(rslt)
          Word  Frequency
    0  society          3
    1      ltd          2
    2      the          1
    3       co          1
    

    Also is possible remove lower if necessary:

    top_N = 4
    a = data['Firm_Name'].str.cat(sep=' ')
    words = nltk.tokenize.word_tokenize(a)
    word_dist = nltk.FreqDist(words)
    rslt = pd.DataFrame(word_dist.most_common(top_N),
                        columns=['Word', 'Frequency'])
    print(rslt)
             Word  Frequency
    0     Society          3
    1         Ltd          2
    2         MMV          1
    3  Kensington          1
    
    0 讨论(0)
  • 2020-12-04 18:29

    IIUIC, use value_counts()

    In [3361]: df.Firm_Name.str.split(expand=True).stack().value_counts()
    Out[3361]:
    Society       3
    Ltd           2
    James's       1
    R.X.          1
    Yah           1
    Associates    1
    St            1
    Kensington    1
    MMV           1
    Big           1
    &             1
    The           1
    Co            1
    Oil           1
    Building      1
    dtype: int64
    

    Or,

    pd.Series(np.concatenate([x.split() for x in df.Firm_Name])).value_counts()
    

    Or,

    pd.Series(' '.join(df.Firm_Name).split()).value_counts()
    

    For top N, for example 3

    In [3379]: pd.Series(' '.join(df.Firm_Name).split()).value_counts()[:3]
    Out[3379]:
    Society    3
    Ltd        2
    James's    1
    dtype: int64
    

    Details

    In [3380]: df
    Out[3380]:
          URN                   Firm_Name
    0  104472               R.X. Yah & Co
    1  104873        Big Building Society
    2  109986          St James's Society
    3  114058  The Kensington Society Ltd
    4  113438      MMV Oil Associates Ltd
    
    0 讨论(0)
  • 2020-12-04 18:47

    This answer can also be used - Count distinct words from a Pandas Data Frame. It utilizes the Counter method and applies it to each row.

    from collections import Counter
    c = Counter()
    df = pd.DataFrame(
        [[104472,"R.X. Yah & Co"],
        [104873,"Big Building Society"],
        [109986,"St James's Society"],
        [114058,"The Kensington Society Ltd"],
        [113438,"MMV Oil Associates Ltd"]
    ], columns=["URN","Firm_Name"])
    df.Firm_Name.str.split().apply(c.update)
    
    Counter({'R.X.': 1,
             'Yah': 1,
             '&': 1,
             'Co': 1,
             'Big': 1,
             'Building': 1,
             'Society': 3,
             'St': 1,
             "James's": 1,
             'The': 1,
             'Kensington': 1,
             'Ltd': 2,
             'MMV': 1,
             'Oil': 1,
             'Associates': 1})
    
    0 讨论(0)
提交回复
热议问题