similarity

How to group text data based on document similarity?

落爺英雄遲暮 提交于 2019-12-08 06:30:09
问题 Consider the dataframe like below df = pd.DataFrame({'Questions': ['What are you doing?','What are you doing tonight?','What are you doing now?','What is your name?','What is your nick name?','What is your full name?','Shall we meet?', 'How are you doing?' ]}) Questions 0 What are you doing? 1 What are you doing tonight? 2 What are you doing now? 3 What is your name? 4 What is your nick name? 5 What is your full name? 6 Shall we meet? 7 How are you doing? How to group the dataframe with

How to measure Syntactic Similarity between a query and a document?

99封情书 提交于 2019-12-08 05:27:10
问题 Is there a way to measure the syntactic similarity between a query (sentence) and a document (a set of sentences)? 回答1: Have you considered using deep linguistic processing tools that involves deep grammars like HPSG and LFG? If you're looking in to feature based syntax similarity you can take a look at Kenji Sagae and Andrew S. Gordon's work on calculating syntactic similarity of verbs using PropBank and then clustering the similar verbs to improve HPSG grammar. To have a simpler approach I

How Locality Sensitive Hashing (LSH) works?

陌路散爱 提交于 2019-12-08 02:49:54
问题 I've read already this question, but unfortunately it didn't help. What I don't understand is what we do once we understood which bucket assign to our high-dimensional space query vector q : suppose that using our set of locality sensitive family functions h_1,h_2,...,h_n we have translated q to a low-dimension ( n dimensions) hash code c . Then c is the index of the bucket which q is assigned to and where (hopefully) are assigned also its nearest neighbors, let say that there are 100 vectors

Appropriate similarity metrics for multiple sets of 2D coordinates

感情迁移 提交于 2019-12-08 01:14:12
问题 I have a collection of 2D coordinate sets (on the scale of a 100K-500K points in each set) and I am looking for the most efficient way to measure the similarity of 1 set to the other. I know of the usuals: Cosine, Jaccard/Tanimoto, etc. However I am hoping for some suggestions on any fast/efficient ones to measure similarity, especially ones that can cluster by similarity. Edit 1: The image shows what I need to do. I need to cluster all the reds, blues and greens by their shape/orientatoin,

How does Stack Overflow display similar questions when you type in a new q​uestion?

不想你离开。 提交于 2019-12-08 00:38:55
问题 This is one of the things that Stack Overflow and the rest of the sites that run on this platform do very well. As soon as you try to create a new question, a little window is shown that shows other similar questions. How is this done? What technology can be used to achieve this? Lucene, Sphinx, ...? 回答1: StackOverflow (and StackExchange in general) uses Lucene.net for full-text search. Might want to read this as well. 来源: https://stackoverflow.com/questions/5208130/how-does-stack-overflow

Similarity function for Mahout boolean user-based recommender

柔情痞子 提交于 2019-12-07 22:38:01
问题 I am using Mahout to build a user-based recommendation system which operates with boolean data. I use GenericBooleanPrefUserBasedRecommender , NearestNUserNeighborhood and now trying to decide about the most suitable user similarity function. It was suggested to use either LogLikelihoodSimilarity or TanimotoCoefficientSimilarity . I tried both and am getting [subjectively evaluated] meaningful results in both cases. However the RMSE rating for the same data set is better the LogLikehood. The

How I can write SPARQL query that uses similarity measures in Java Code

无人久伴 提交于 2019-12-07 18:28:27
问题 I would like to know a simple method to write this SPARQL query in Java Code: select ?input ?string (strlen(?match)/strlen(?string) as ?percent) where { values ?string { "London" "Londn" "London Fog" "Lando" "Land Ho!" "concatenate" "catnap" "hat" "cat" "chat" "chart" "port" "part" } values (?input ?pattern ?replacement) { ("cat" "^x[^cat]*([c]?)[^at]*([a]?)[^t]*([t]?).*$" "$1$2$3") ("Londn" "^x[^Londn]*([L]?)[^ondn]*([o]?)[^ndn]*([n]?)[^dn]*([d]?)[^n]*([n]?).*$" "$1$2$3$4$5") } bind( replace

What is an efficient way to measure similarity between two strings? (Levenshtein Distance makes stack too deep)

ぐ巨炮叔叔 提交于 2019-12-07 15:50:23
问题 So, I started with this: http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Ruby Which works great for really small strings. But, my strings can be upwards of 10,000 characters long -- and since the Levenshtein Distance is recursive, this causes a stack too deep error in my Ruby on Rails app. So, is there another, maybe less stack intensive method of finding the similarity between two large strings? Alternatively, I'd need a way to make the stack have much

Normalizing by max value or by total value?

﹥>﹥吖頭↗ 提交于 2019-12-07 07:10:34
问题 I'm doing some work that involves document comparison. To do this, I'm analizing each document, and basically counting the number of times some key words appear on each of these documents. For instance: Document 1: Document 2: Book -> 3 Book -> 9 Work -> 0 Work -> 2 Dollar -> 5 Dollar -> 1 City -> 18 City -> 6 So after the counting process, I store all these sequence of numbers in a vector. This sequence of numbers will represent the feature vector for each document. Document 1: [ 3, 0, 5, 18

Explicit Semantic Analysis

若如初见. 提交于 2019-12-07 05:52:12
问题 I came across this term called 'Explicit Semantic Analysis ' which uses Wikipedia as a reference and finds the similarity in documents and categorizes them into classes (correct me if i am wrong). The link i came across is here I wanted to learn more about it. Please help me out with it ! 回答1: This explicit semantic analysis works on similar lines as semantic similarity . I got hold of this link which provides a clear example of ESA 来源: https://stackoverflow.com/questions/8707624/explicit