Efficient way to create term density matrix from pandas DataFrame

前端未结

关注

 2  2013

自闭症患者 2021-02-06 07:15

I am trying to create a term density matrix from a pandas dataframe, so I can rate terms appearing in the dataframe. I also want to be able to keep the \'spatial\' aspect of my

2条回答

感动是毒 (楼主)

2021-02-06 08:09

You can use scikit-learn's CountVectorizer:

In [14]: from sklearn.feature_extraction.text import CountVectorizer

In [15]: countvec = CountVectorizer()

In [16]: countvec.fit_transform(df.title)
Out[16]: 
<4x8 sparse matrix of type ''
    with 9 stored elements in Compressed Sparse Column format>

It returns the term document matrix in sparse representation because such matrix is usually huge and, well, sparse.

For your particular example I guess converting it back to a DataFrame would still work:

In [17]: pd.DataFrame(countvec.fit_transform(df.title).toarray(), columns=countvec.get_feature_names())
Out[17]: 
   boiled  delicious  egg  else  fried  orange  something  split
0       1          1    1     0      0       0          0      0
1       0          0    1     0      1       0          0      0
2       0          0    0     0      0       1          0      1
3       0          0    0     1      0       0          1      0

[4 rows x 8 columns]

0 讨论(0)

查看其它2个回答