similarity

Compare 5000 strings with PHP Levenshtein

戏子无情 提交于 2019-12-03 03:13:34
问题 I have 5000, sometimes more, street address strings in an array. I'd like to compare them all with levenshtein to find similar matches. How can I do this without looping through all 5000 and comparing them directly with every other 4999? Edit: I am also interested in alternate methods if anyone has suggestions. The overall goal is to find similar entries (and eliminate duplicates) based on user-submitted street addresses. 回答1: I think a better way to group similar addresses would be to:

Similarity function in Postgres with pg_trgm

匿名 (未验证) 提交于 2019-12-03 02:47:02
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I'm trying to use the similarity function in Postgres to do some fuzzy text matching, however whenever I try to use it I get the error: function similarity(character varying, unknown) does not exist If I add explicit casts to text I get the error: function similarity(text, text) does not exist My query is: SELECT (similarity("table"."field"::text, %s::text)) AS "similarity", "table".* FROM "table" WHERE similarity > .5 ORDER BY "similarity" DESC LIMIT 10 Do I need to do something to initalize pg_trgm? 回答1: You have to install pg_trgm. In

Calculate Cosine Similarity Spark Dataframe

匿名 (未验证) 提交于 2019-12-03 02:28:01
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 由 翻译 强力驱动 问题: I am using Spark Scala to calculate cosine similarity between the Dataframe rows. Dataframe format is below root |-- SKU : double ( nullable = true ) |-- Features : vector ( nullable = true ) Sample of the dataframe below +-------+--------------------+ | SKU | Features | +-------+--------------------+ | 9970.0 |[ 4.7143 , 0.0 , 5.785 ...| | 19676.0 |[ 5.5 , 0.0 , 6.4286 , 4. ..| | 3296.0 |[ 4.7143 , 1.4286 , 6. ...| | 13658.0 |[ 6.2857 , 0.7143 , 4. ...| | 1.0 |[ 4.2308 , 0.7692 , 5. ...| | 513.0 |[ 3.0 , 0.0 , 4.9091 , 5. ..| |

How do I calculate the shortest path (geodesic) distance between two adjectives in WordNet using Python NLTK?

匿名 (未验证) 提交于 2019-12-03 02:05:01
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: Computing the semantic similarity between two synsets in WordNet can be easily done with several built-in similarity measures, such as: synset1.path_similarity(synset2) synset1.lch_similarity(synset2) , Leacock-Chodorow Similarity synset1.wup_similarity(synset2) , Wu-Palmer Similarity (as seen here) However, all of these exploit WordNet's taxonomic relations, which are relations for nouns and verbs. Adjectives and adverbs are related via synonymy, antonymy and pertainyms. How can one measure the distance (number of hops) between two

Levenshtein distance: how to better handle words swapping positions?

非 Y 不嫁゛ 提交于 2019-12-03 01:04:50
问题 I've had some success comparing strings using the PHP levenshtein function. However, for two strings which contain substrings that have swapped positions, the algorithm counts those as whole new substrings. For example: levenshtein("The quick brown fox", "brown quick The fox"); // 10 differences are treated as having less in common than: levenshtein("The quick brown fox", "The quiet swine flu"); // 9 differences I'd prefer an algorithm which saw that the first two were more similar. How could

Compare similarity algorithms

▼魔方 西西 提交于 2019-12-03 00:41:19
问题 I want to use string similarity functions to find corrupted data in my database. I came upon several of them: Jaro, Jaro-Winkler, Levenshtein, Euclidean and Q-gram, I wanted to know what is the difference between them and in what situations they work best? 回答1: Expanding on my wiki-walk comment in the errata and noting some of the ground-floor literature on the comparability of algorithms that apply to similar problem spaces, let's explore the applicability of these algorithms before we

Algorithm to find related words in a text

无人久伴 提交于 2019-12-03 00:33:58
I would like to have a word (e.g. "Apple) and process a text (or maybe more). I'd like to come up with related terms. For example: process a document for Apple and find that iPod, iPhone, Mac are terms related to "Apple". Any idea on how to solve this? As a starting point: your question relates to text mining . There are two ways: a statistical approach, and one form natural language processing (nlp). I do not know much about nlp, but can say something about the statistical approach: You need some vector space representation of your documents, see http://en.wikipedia.org/wiki/Vector_space

Find similar images in (pure) PHP / MySQL

点点圈 提交于 2019-12-03 00:29:58
问题 My users are uploading images to my website and i would like first to offer them already uploaded images first. My idea is to 1. create some kind of image "hash" of every existing image 2. create a hash of newly uploaded image and compare it with the other in the database i have found some interesting solutions like http://www.pureftpd.org/project/libpuzzle or or http://phash.org/ etc. but they got one or more problems they need some nonstandard extension to PHP (or are not in PHP at all) -

Calculating Binary Data Similarity

耗尽温柔 提交于 2019-12-03 00:23:52
问题 I've seen a few questions here related to determining the similarity of files, but they are all linked to a particular domain (images, sounds, text, etc). The techniques offered as solutions require knowledge of the underlying file format of the files being compared. What I am looking for is a method without this requirement, where arbitrary binary files could be compared without needing to understand what type of data they contain. That is, I am looking to determine the similarity percentage

机器学习实战 11- SVD

匿名 (未验证) 提交于 2019-12-02 23:39:01
1 SVD 相关理论 1.1 奇异值分解定义 SVD,即奇异值分解,矩阵论中学过内容,公式为: 用一张图来形象表达一下: 其中的奇异值矩阵,奇异值从大到小排列,而且减少的特别快,我们可以用前面最大的k个奇异值来表示这个原有矩阵,且保留大部分信息。如图: 1.2 与PCA有何勾结 之前我们解释过 PCA ,PCA 的本质就是将原数据投影到低维坐标系中,获取主要信息值,起到降维作用。 究竟 SVD 与其有何关系呢?我们看一下下面的公式推导,就明白了。 由上图可以比较出,PCA 中得到的 P 就是奇异值分解中的 V。 如果需要给列降维到 k ,则: 如果需要给行降维到 k ,则: 即 左奇异矩阵负责给行降维,一般用来减小无效样本。 1.3 优点与用途 优点: 当我们使用PCA的时候,用到协方差矩阵,但是当样本数量m大特征n也大的时候,协方差就不好计算,特征值和特征向量就不易求出。此时如果使用SVD,速度快很多,有人会说,奇异值分解时不也是利用了协方差矩阵么?其实不然这只是求取分解的一种方法,一般会使用迭代的办法,这里不赘述,只需明白它的计算速度远远加快,有兴趣可以看看这两篇文章: 并行计算奇异值: https://www.cnblogs.com/zhangchaoyang/articles/2575948.html 一篇关于奇异值计算的论文: http://www.cs.utexas