cluster-analysis

Is it possible to run a clustering algorithm with chunked distance matrices?

戏子无情 提交于 2021-02-08 08:19:24
问题 I have a distance/dissimilarity matrix (30K rows 30K columns) that is calculated in a loop and stored in ROM. I would like to do clustering over the matrix. I import and cluster it as below: Mydata<-read.csv("Mydata.csv") Mydata<-as.dist(Mydata) Results<-hclust(Mydata) But when I convert the matrix to dist object, I get RAM limitation error. How can I handle it? Can I run hclust algorithm in a loop/chunking? I mean I divide the distance matrix into chunks and run them in a loop? 回答1: You may

Faster Kmeans Clustering on High-dimensional Data with GPU Support

落爺英雄遲暮 提交于 2021-02-08 05:16:37
问题 We've been using Kmeans for clustering our logs. A typical dataset has 10 mill. samples with 100k+ features. To find the optimal k - we run multiple Kmeans in parallel and pick the one with the best silhouette score. In 90% of the cases we end up with k between 2 and 100. Currently, we are using scikit-learn Kmeans. For such a dataset, clustering takes around 24h on ec2 instance with 32 cores and 244 RAM. I've been currently researching for a faster solution. What I have already tested:

Faster Kmeans Clustering on High-dimensional Data with GPU Support

Deadly 提交于 2021-02-08 05:15:31
问题 We've been using Kmeans for clustering our logs. A typical dataset has 10 mill. samples with 100k+ features. To find the optimal k - we run multiple Kmeans in parallel and pick the one with the best silhouette score. In 90% of the cases we end up with k between 2 and 100. Currently, we are using scikit-learn Kmeans. For such a dataset, clustering takes around 24h on ec2 instance with 32 cores and 244 RAM. I've been currently researching for a faster solution. What I have already tested:

Plotting different clusters markers for every class in scatter plot

左心房为你撑大大i 提交于 2021-02-07 18:31:54
问题 I have a scatter plot where i am plotting 14 clusters, but each 2 clusters belong to the same class, they are all using the same markers. Every 50 rows is a cluster and every 100 rows is two clusters of the same class. What i want to do is change the markers for every 2 clusters or 100 rows. Link for the Data Frame import pandas as pd import numpy as np from matplotlib import pyplot as plt from matplotlib.pyplot import figure y = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

Plotting different clusters markers for every class in scatter plot

安稳与你 提交于 2021-02-07 18:30:54
问题 I have a scatter plot where i am plotting 14 clusters, but each 2 clusters belong to the same class, they are all using the same markers. Every 50 rows is a cluster and every 100 rows is two clusters of the same class. What i want to do is change the markers for every 2 clusters or 100 rows. Link for the Data Frame import pandas as pd import numpy as np from matplotlib import pyplot as plt from matplotlib.pyplot import figure y = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

Efficient algorithm to group points in clusters by distance between every two points

这一生的挚爱 提交于 2021-02-07 13:30:56
问题 I am looking for an efficient algorithm for the following problem: Given a set of points in 2D space, where each point is defined by its X and Y coordinates. Required to split this set of points into a set of clusters so that if distance between two arbitrary points is less then some threshold, these points must belong to the same cluster: In other words, such cluster is a set of points which are 'close enough' to each other. The naive algorithm may look like this: Let R be a resulting list

Clustering human faces from a video

烂漫一生 提交于 2021-02-07 03:48:02
问题 I have run the face detection algorithm inbuilt in opencv to extract faces in each frame of a video(sampled at 1 fps). I have also resized each face image to be of same size and I have cropped some fraction of image to remove background noise and hair. Now the problem is that I have to cluster these images of faces - Each cluster corresponding to a person. I implemented the algorithm described here http://bitsearch.blogspot.in/2013/02/unsupervised-face-clustering-with-opencv.html Basically

Clustering human faces from a video

丶灬走出姿态 提交于 2021-02-07 03:43:02
问题 I have run the face detection algorithm inbuilt in opencv to extract faces in each frame of a video(sampled at 1 fps). I have also resized each face image to be of same size and I have cropped some fraction of image to remove background noise and hair. Now the problem is that I have to cluster these images of faces - Each cluster corresponding to a person. I implemented the algorithm described here http://bitsearch.blogspot.in/2013/02/unsupervised-face-clustering-with-opencv.html Basically

sample_weight option in the ELKI implementation of DBSCAN

丶灬走出姿态 提交于 2021-02-05 08:50:07
问题 My goal is to find outliers in a dataset that contains many near-duplicate points and I want to use ELKI implementation of DBSCAN for this task. As I don't care about the clusters themselves just the outliers (which I assume are relatively far from the clusters), I want to speed up the runtime by aggregating/binning points on a grid and using the concept implemented in scikit-learn as sample_weight. Can you please show minimum code to do similar analysis in ELKI? Let's assume my dataset

Plot causes “Error: Incorrect Number of Dimensions”

北城以北 提交于 2021-02-01 05:17:26
问题 I am learning about the "kohonen" package in R for the purpose of making Self Organizing Maps (SOM, also called Kohonen Networks - a type of Machine Learning algorithm). I am following this R language tutorial over here: https://www.rpubs.com/loveb/som I tried to create my own data (this time with both "factor" and "numeric" variables) and run the SOM algorithm (this time using the "supersom()" function instead): #load libraries and adjust colors library(kohonen) #fitting SOMs library(ggplot2