I\'m trying to compute the tf-idf vector cosine similarity between two columns in a Pandas dataframe. One column contains a search query, the other contains a product title. The
With the kind help and solution posted by MaxU above, here I present the full code that completed the task I was trying to achieve. In addition to MemoryError
tt also dodges weird nans appearing in the cosine-similarity calculations when I tried some "hacky" workarounds.
Noting the below code is a partial snippet in the sense the large dataframe df_all_export
with dimensions 186,134 x 5
has already been constructed in the full code.
I hope this helps others who are trying to calculate cosine similarity using tf-idf vectors, between search queries and matched documents. For such a common "problem" I struggled to find a clear solution implemented with SKLearn and Pandas.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import paired_cosine_distances as pcd
clf = TfidfVectorizer()
clf.fit(df_all_export['search_term'] + " " + df_all_export['product_title'])
A = clf.transform(df_all_export['search_term'])
B = clf.transform(df_all_export['product_title'])
cosine = 1 - pcd(A, B)
df_all_export['tfidf_cosine'] = cosine