I am trying to create a term density matrix from a pandas dataframe, so I can rate terms appearing in the dataframe. I also want to be able to keep the \'spatial\' aspect of my
You can use scikit-learn's CountVectorizer:
In [14]: from sklearn.feature_extraction.text import CountVectorizer
In [15]: countvec = CountVectorizer()
In [16]: countvec.fit_transform(df.title)
Out[16]:
<4x8 sparse matrix of type ''
with 9 stored elements in Compressed Sparse Column format>
It returns the term document matrix in sparse representation because such matrix is usually huge and, well, sparse.
For your particular example I guess converting it back to a DataFrame would still work:
In [17]: pd.DataFrame(countvec.fit_transform(df.title).toarray(), columns=countvec.get_feature_names())
Out[17]:
boiled delicious egg else fried orange something split
0 1 1 1 0 0 0 0 0
1 0 0 1 0 1 0 0 0
2 0 0 0 0 0 1 0 1
3 0 0 0 1 0 0 1 0
[4 rows x 8 columns]