cluster-analysis

Python: clustering similar words based on word2vec

一曲冷凌霜 提交于 2019-12-12 04:54:20
问题 This might be the naive question which I am about to ask. I have a tokenized corpus on which I have trained Gensim's Word2vec model. The code is as below site = Article("http://www.datasciencecentral.com/profiles/blogs/blockchain-and-artificial-intelligence-1") site.download() site.parse() def clean(doc): stop_free = " ".join([i for i in word_tokenize(doc.lower()) if i not in stop]) punc_free = ''.join(ch for ch in stop_free if ch not in exclude) normalized = " ".join(lemma.lemmatize(word)

writing a similarity function for images for clustering data

99封情书 提交于 2019-12-12 04:51:27
问题 I know how to write a similarity function for data points in euclidean space (by taking the negative min sqaured error.) Now if I want to check my clustering algorithms on images how can I write a similarity function for data points in images? Do I base it on their RGB values or what? and how? 回答1: I think we need to clarify better some points: Are you clustering only on color? So, take RGB values for pixels and apply your metric function (minimize sum of sq. error, or just calculate SAD -

How to find most similar terms/words of a document in doc2vec? [duplicate]

爱⌒轻易说出口 提交于 2019-12-12 04:08:49
问题 This question already has answers here : How to intrepret Clusters results after using Doc2vec? (3 answers) Closed 2 years ago . I have applied Doc2vec to convert documents into vectors.After that, I used the vectors in clustering and figured out the 5 nearest/most similar document to the centroid of each cluster. Now I need to find the most dominant or important terms of these documents so that I can figure out the characteristics of each cluster. My question is is there any way to figure

Drawbacks of K-Medoid (PAM) Algorithm

社会主义新天地 提交于 2019-12-12 03:55:53
问题 I have researched that K-medoid Algorithm (PAM) is a parition-based clustering algorithm and a variant of K-means algorithm. It has solved the problems of K-means like producing empty clusters and the sensitivity to outliers/noise. However, the time complexity of K-medoid is O(n^2), unlike K-means (Lloyd's Algorithm) which has a time complexity of O(n). I would like to ask if there are other drawbacks of K-medoid algorithm aside from its time complexity. 回答1: The main disadvantage of K-Medoid

How to calculate clustering entropy - example and my solution given but is it correct? [closed]

只谈情不闲聊 提交于 2019-12-12 03:35:13
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 3 years ago . I would like to calculate entropy of this example scheme http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html Equation of entropy Then the entropy is (the first line) So entropy is for this scheme For the first cluster - ( (5/6)*Log(5/6) + (1/6)*Log(1/6) ) For the second cluster - (

Deciding input values to DBSCAN algorithm

南楼画角 提交于 2019-12-12 03:15:53
问题 I have written code in python to implement DBSCAN clustering algorithm. My dataset consists of 14k users with each user represented by 10 features. I am unable to decide what exactly to keep as the value of Min_samples and epsilon as input How should I decide that? Similarity measure is euclidean distance.(Hence it becomes even more tough to decide.) Any pointers? 回答1: DBSCAN is pretty often hard to estimate its parameters. Did you think about the OPTICS algorithm? You only need in this case

Running OPTICS algorithm on ELKI

时光毁灭记忆、已成空白 提交于 2019-12-12 03:00:02
问题 I'm normally an R user (a beginning R user, but I'm starting to get the hang of it). However, I have heard positive things about ELKI--in particular, its speed. I came across this old post "How to group nearby latitude and longitude locations stored in SQL" and the answer posted by Anony-Mousse is similar to what I'd like to do. I would like to be able to replicate each step he has done up to the KML file he has shared on Google Drive. I've downloaded ELKI and am able to run the mini-GUI,

Weighting k Means Clustering by number of observations

你说的曾经没有我的故事 提交于 2019-12-12 01:55:26
问题 I would like to cluster some data using k Means in R that looks as follows. ADP NS CNTR PP2V EML PP1V ADDPS FB PP1D ADR ISV PP2D ADSEM SUMALL CONV 2 0 0 1 0 0 0 0 0 12 0 12 0 53 0 2 0 0 1 0 0 0 0 0 14 0 25 0 53 0 2 0 0 1 0 0 0 0 0 15 0 0 0 53 0 2 0 0 1 0 0 0 0 0 15 0 4 0 53 0 2 0 0 1 0 0 0 0 0 17 0 0 0 53 0 2 0 0 1 0 0 0 0 0 18 0 0 0 106 0 2 0 0 1 0 0 0 0 0 23 0 10 0 53 0 2 0 0 1 0 0 1 0 0 0 0 1 0 106 0 2 0 0 1 0 0 3 0 0 0 0 0 0 53 0 2 0 0 2 0 0 0 0 0 0 0 0 0 3922 0 2 0 0 2 0 0 0 0 0 0 0 1 0

How to do column wise intersection with itertools

爱⌒轻易说出口 提交于 2019-12-12 01:47:27
问题 When I calculate the jaccard similarity between each of my training data of (m) training examples each with 6 features (Age,Occupation,Gender,Product_range, Product_cat and Product) forming a (m*m) similarity matrix. I get a different outcome for matrix. I have identified the problem source but do not posses a optimized solution for the same. Find the sample of the dataset below: ID AGE Occupation Gender Product_range Product_cat Product 1100 25-34 IT M 50-60 Gaming XPS 6610 1101 35-44

How to determine the K value for k-means algorithm? [duplicate]

怎甘沉沦 提交于 2019-12-12 00:38:31
问题 This question already has answers here : Closed 7 years ago . Possible Duplicate: How do I determine k when using k-means clustering? How can we determine the value of K(the number of clusters) for the k-means algorithm ?? 回答1: Sometimes. There are various methods, that usually require trying different values of k and measuring which worked best. Here are some duplicate questions you missed: How to optimal K in K - Means Algorithm K-Means Algorithm Kmeans without knowing the number of