similarity | 易学教程

Text similarity using Word2Vec

阅读更多关于 Text similarity using Word2Vec

问题 I would like to use Word2Vec to check similarity of texts. I am currently using another logic: from fuzzywuzzy import fuzz def sim(name, dataset): matches = dataset.apply(lambda row: ((fuzz.ratio(row['Text'], name) ) = 0.5), axis=1) return (name is my column). For applying this function I do the following: df['Sim']=df.apply(lambda row: sim(row['Text'], df), axis=1) Could you please tell me how to replace fuzzy.ratio with Word2Vec in order to compare texts in a dataset? Example of dataset:

Optimizing a postgres similarity query (pg_trgm + gin index)

阅读更多关于 Optimizing a postgres similarity query (pg_trgm + gin index)

问题 I have defined the following index: CREATE INDEX users_search_idx ON auth_user USING gin( username gin_trgm_ops, first_name gin_trgm_ops, last_name gin_trgm_ops ); I am performing the following query: PREPARE user_search (TEXT, INT) AS SELECT username, email, first_name, last_name, ( -- would probably do per-field weightings here s_username + s_first_name + s_last_name ) rank FROM auth_user, similarity(username, $1) s_username, similarity(first_name, $1) s_first_name, similarity(last_name, $1

Computing degree of similarity among a group of sets

阅读更多关于 Computing degree of similarity among a group of sets

问题 Suppose there are 4 sets: s1={1,2,3,4}; s2={2,3,4}; s3={2,3,4,5}; s4={1,3,4,5}; Is there any standard metric to present the similarity degree of this group of 4 sets? Thank you for the suggestion of Jaccard method. However, it seems pairwise. How can I compute the similarity degree of the whole group of sets? 回答1: Pairwise, you can compute the Jaccard distance of two sets. It's simply the distance between two sets, if they were vectors of booleans in a space where {1, 2, 3…} are all unit

Check the similarity between two words with NLTK with Python

阅读更多关于 Check the similarity between two words with NLTK with Python

问题 I have a two lists and I want to check the similarity between each words in the two list and find out the maximum similarity.Here is my code, from nltk.corpus import wordnet list1 = ['Compare', 'require'] list2 = ['choose', 'copy', 'define', 'duplicate', 'find', 'how', 'identify', 'label', 'list', 'listen', 'locate', 'match', 'memorise', 'name', 'observe', 'omit', 'quote', 'read', 'recall', 'recite', 'recognise', 'record', 'relate', 'remember', 'repeat', 'reproduce', 'retell', 'select', 'show

Calculating the similarity of 2 sets of convex polygons?

阅读更多关于 Calculating the similarity of 2 sets of convex polygons?

问题 I have generated 2 sets of convex polygons from with different algorithms. Every polygon in each set is described by an array of coordinates[n_points, xy_coords], so a square is described by an array [4,2] but a pentagon with rounded corners has [80,2], with the extra 75 points being used to describe the curvatures. My goal is to quantify how similar the two sets of geometries are. Can anyone recommend any methods of doing so? So far I've come across: Hamming Distance Hausdorff distance I

Calculating the similarity of 2 sets of convex polygons?

阅读更多关于 Calculating the similarity of 2 sets of convex polygons?

How to compute similarity in quanteda between documents for adjacent years only, within groups?

阅读更多关于 How to compute similarity in quanteda between documents for adjacent years only, within groups?

问题 I have a diachronic corpus with texts for different organizations, each for years 1969 to 2019. For each organization, I want to compare text for year 1969 and text for 1970, 1970 and 1971, etc. Texts for some years are missing. In other words, I have a corpus, cc, which I converted to a dfm Now I want to use textstat_simil : ncsimil <- textstat_simil(dfm.cc, y = NULL, selection = NULL, margin = "documents", method = "jaccard", min_simil = NULL) This compares every text with every other text,

How to compute similarity in quanteda between documents for adjacent years only, within groups?

阅读更多关于 How to compute similarity in quanteda between documents for adjacent years only, within groups?

How to compute similarity in quanteda between documents for adjacent years only, within groups?

阅读更多关于 How to compute similarity in quanteda between documents for adjacent years only, within groups?

pandas:calculate jaccard similarity for every row based on the value in another column

阅读更多关于 pandas:calculate jaccard similarity for every row based on the value in another column

问题 I have a dataframe as follows, only with more rows: import pandas as pd data = {'First': ['First value', 'Second value','Third value'], 'Second': [['old','new','gold','door'], ['old','view','bold','door'],['new','view','world','window']]} df = pd.DataFrame (data, columns = ['First','Second']) To calculate the jaccard similarity i found this piece online(not my solution): def lexical_overlap(doc1, doc2): words_doc1 = set(doc1) words_doc2 = set(doc2) intersection = words_doc1.intersection(words