similarity

Text similarity using Word2Vec

你说的曾经没有我的故事 提交于 2021-02-19 05:36:05
问题 I would like to use Word2Vec to check similarity of texts. I am currently using another logic: from fuzzywuzzy import fuzz def sim(name, dataset): matches = dataset.apply(lambda row: ((fuzz.ratio(row['Text'], name) ) = 0.5), axis=1) return (name is my column). For applying this function I do the following: df['Sim']=df.apply(lambda row: sim(row['Text'], df), axis=1) Could you please tell me how to replace fuzzy.ratio with Word2Vec in order to compare texts in a dataset? Example of dataset:

Optimizing a postgres similarity query (pg_trgm + gin index)

喜夏-厌秋 提交于 2021-02-17 22:52:16
问题 I have defined the following index: CREATE INDEX users_search_idx ON auth_user USING gin( username gin_trgm_ops, first_name gin_trgm_ops, last_name gin_trgm_ops ); I am performing the following query: PREPARE user_search (TEXT, INT) AS SELECT username, email, first_name, last_name, ( -- would probably do per-field weightings here s_username + s_first_name + s_last_name ) rank FROM auth_user, similarity(username, $1) s_username, similarity(first_name, $1) s_first_name, similarity(last_name, $1

Computing degree of similarity among a group of sets

对着背影说爱祢 提交于 2021-02-17 16:58:25
问题 Suppose there are 4 sets: s1={1,2,3,4}; s2={2,3,4}; s3={2,3,4,5}; s4={1,3,4,5}; Is there any standard metric to present the similarity degree of this group of 4 sets? Thank you for the suggestion of Jaccard method. However, it seems pairwise. How can I compute the similarity degree of the whole group of sets? 回答1: Pairwise, you can compute the Jaccard distance of two sets. It's simply the distance between two sets, if they were vectors of booleans in a space where {1, 2, 3…} are all unit

Check the similarity between two words with NLTK with Python

﹥>﹥吖頭↗ 提交于 2021-02-17 16:35:38
问题 I have a two lists and I want to check the similarity between each words in the two list and find out the maximum similarity.Here is my code, from nltk.corpus import wordnet list1 = ['Compare', 'require'] list2 = ['choose', 'copy', 'define', 'duplicate', 'find', 'how', 'identify', 'label', 'list', 'listen', 'locate', 'match', 'memorise', 'name', 'observe', 'omit', 'quote', 'read', 'recall', 'recite', 'recognise', 'record', 'relate', 'remember', 'repeat', 'reproduce', 'retell', 'select', 'show

Calculating the similarity of 2 sets of convex polygons?

点点圈 提交于 2021-02-11 12:20:01
问题 I have generated 2 sets of convex polygons from with different algorithms. Every polygon in each set is described by an array of coordinates[n_points, xy_coords], so a square is described by an array [4,2] but a pentagon with rounded corners has [80,2], with the extra 75 points being used to describe the curvatures. My goal is to quantify how similar the two sets of geometries are. Can anyone recommend any methods of doing so? So far I've come across: Hamming Distance Hausdorff distance I

Calculating the similarity of 2 sets of convex polygons?

安稳与你 提交于 2021-02-11 12:17:48
问题 I have generated 2 sets of convex polygons from with different algorithms. Every polygon in each set is described by an array of coordinates[n_points, xy_coords], so a square is described by an array [4,2] but a pentagon with rounded corners has [80,2], with the extra 75 points being used to describe the curvatures. My goal is to quantify how similar the two sets of geometries are. Can anyone recommend any methods of doing so? So far I've come across: Hamming Distance Hausdorff distance I

How to compute similarity in quanteda between documents for adjacent years only, within groups?

大兔子大兔子 提交于 2021-02-11 06:17:46
问题 I have a diachronic corpus with texts for different organizations, each for years 1969 to 2019. For each organization, I want to compare text for year 1969 and text for 1970, 1970 and 1971, etc. Texts for some years are missing. In other words, I have a corpus, cc, which I converted to a dfm Now I want to use textstat_simil : ncsimil <- textstat_simil(dfm.cc, y = NULL, selection = NULL, margin = "documents", method = "jaccard", min_simil = NULL) This compares every text with every other text,

How to compute similarity in quanteda between documents for adjacent years only, within groups?

给你一囗甜甜゛ 提交于 2021-02-11 06:17:42
问题 I have a diachronic corpus with texts for different organizations, each for years 1969 to 2019. For each organization, I want to compare text for year 1969 and text for 1970, 1970 and 1971, etc. Texts for some years are missing. In other words, I have a corpus, cc, which I converted to a dfm Now I want to use textstat_simil : ncsimil <- textstat_simil(dfm.cc, y = NULL, selection = NULL, margin = "documents", method = "jaccard", min_simil = NULL) This compares every text with every other text,

How to compute similarity in quanteda between documents for adjacent years only, within groups?

谁说胖子不能爱 提交于 2021-02-11 06:17:17
问题 I have a diachronic corpus with texts for different organizations, each for years 1969 to 2019. For each organization, I want to compare text for year 1969 and text for 1970, 1970 and 1971, etc. Texts for some years are missing. In other words, I have a corpus, cc, which I converted to a dfm Now I want to use textstat_simil : ncsimil <- textstat_simil(dfm.cc, y = NULL, selection = NULL, margin = "documents", method = "jaccard", min_simil = NULL) This compares every text with every other text,

pandas:calculate jaccard similarity for every row based on the value in another column

≡放荡痞女 提交于 2021-02-10 18:17:43
问题 I have a dataframe as follows, only with more rows: import pandas as pd data = {'First': ['First value', 'Second value','Third value'], 'Second': [['old','new','gold','door'], ['old','view','bold','door'],['new','view','world','window']]} df = pd.DataFrame (data, columns = ['First','Second']) To calculate the jaccard similarity i found this piece online(not my solution): def lexical_overlap(doc1, doc2): words_doc1 = set(doc1) words_doc2 = set(doc2) intersection = words_doc1.intersection(words