问题
I have df of string values
Keyword
plant
cell
cat
Pandas
And I want to find the relationship or correlation between these two string values.
I have used pandas corr = df1.corrwith(df2,axis=0)
.
But this is useful for to find the correlation between the numerical values but I want to see whether the two strings are related by finding the correlation distance. How can I do that?
回答1:
There are a few steps here, the first thing you need to do is extract some sort of vector for each word.
A good way is using gensim word2vec (you need to download the files from here):
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('data/GoogleGoogleNews-vectors-negative300.bin', binary=True)
after getting the pretrained vectors you need to extract the vector for each word:
vector = model['plant']
or in the pandas column example:
df['Vectors'] = df['Keyword'].apply(lambda x: model[x])
Once this is done you can calculate the distance between two vectors using a number of methodologies, eg euclidean distance:
from sklearn.metrics.pairwise import euclidean_distances
distances = euclidean_distances(list(df['Vectors']))
distances will be a matrix, with 0 on the diagonal and the distance of all words from each other. The closer a distance is to 0, the more similar the words are.
You can use different models and different distance metrics, but you can use this as a starting point.
来源:https://stackoverflow.com/questions/55394673/how-to-find-the-correlation-between-two-strings-in-pandas