I have two data frames that I am trying to merge.
Dataframe A:
col1 col2 sub grade
0 1 34.32 x a
1 1 34.32 x
I had a similar problem where I needed to identify matching rows with thousands of float columns and no identifier. This case is difficult because values can vary slightly due to rounding.
In this case, I used scipy.spatial.distance.cosine to get the cosine similarity between rows.
from scipy import distance
threshold = 0.99999
similarity = 1 - spatial.distance.cosine(row1, row2)
if similarity >= threshold:
# it's a match
else:
# loop and check another row pair
This won't work if you have duplicate or very similar rows, but when you have a large number of float columns and not too many of rows, it works well.