Create a frequency matrix for bigrams from a list of tuples, using numpy or pandas

穿精又带淫゛_ 提交于 2021-02-05 07:47:27

问题


I am very new to Python. I have a list of tuples, where I created bigrams.

This question is pretty close to my needs

my_list = [('we', 'consider'), ('what', 'to'), ('use', 'the'), ('words', 'of')]

Now I am trying to convert this into a frequency matrix

The desired output is

          consider  of  the  to  use  we  what  words
consider         0   0    0   0    0   0     0      0
of               0   0    0   0    0   0     0      0
the              0   0    0   0    0   0     0      0
to               0   0    0   0    0   0     0      0
use              0   0    1   0    0   0     0      0
we               1   0    0   0    0   0     0      0
what             0   0    0   1    0   0     0      0
words            0   1    0   0    0   0     0      0

How to do this, using numpy or pandas? I can see something with nltk only, unfortunately.


回答1:


You can create frequancy data frame and call index-values by words:

words=sorted(list(set([item for t in my_list for item in t])))
df = pd.DataFrame(0, columns=words, index=words)
for i in my_list:
  df.at[i[0],i[1]] += 1

output:

          consider  of  the  to  use  we  what  words
consider         0   0    0   0    0   0     0      0
of               0   0    0   0    0   0     0      0
the              0   0    0   0    0   0     0      0
to               0   0    0   0    0   0     0      0
use              0   0    1   0    0   0     0      0
we               1   0    0   0    0   0     0      0
what             0   0    0   1    0   0     0      0
words            0   1    0   0    0   0     0      0

Note that in this one, the order in the bigram matters. If you don't care about order, you should sort the tuples by their content first, using this:

my_list = [tuple(sorted(i)) for i in my_list]

Another way is to use Counter to do the count, but I expect it to be similar performance(again if order in bigrams matters, remove sorted from frequency_list):

from collections import Counter

frequency_list = Counter(tuple(sorted(i)) for i in my_list)
words=sorted(list(set([item for t in my_list for item in t])))
df = pd.DataFrame(0, columns=words, index=words)
for k,v in frequency_list.items():
  df.at[k[0],k[1]] = v

output:

          consider  of  the  to  use  we  what  words
consider         0   0    0   0    0   1     0      0
of               0   0    0   0    0   0     0      1
the              0   0    0   0    1   0     0      0
to               0   0    0   0    0   0     1      0
use              0   0    0   0    0   0     0      0
we               0   0    0   0    0   0     0      0
what             0   0    0   0    0   0     0      0
words            0   0    0   0    0   0     0      0



回答2:


If you do not care about speed too much you could use for loop.

import pandas as pd
import numpy as np
from itertools import product

my_list = [('we', 'consider'), ('what', 'to'), ('use', 'the'), ('words', 'of')]

index = pd.DataFrame(my_list)[0].unique()
columns = pd.DataFrame(my_list)[1].unique()
df = pd.DataFrame(np.zeros(shape=(len(columns), len(index))),
                  columns=columns, index=index, dtype=int)

for idx,col in product(index, columns):
    df[col].loc[idx] = my_list.count((idx, col))

print(df)

Output:

       consider  to  the  of
we            1   0    0   0
what          0   1    0   0
use           0   0    1   0
words         0   0    0   1


来源:https://stackoverflow.com/questions/62946067/create-a-frequency-matrix-for-bigrams-from-a-list-of-tuples-using-numpy-or-pand

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!