dbscan

DBSCAN聚类

匿名 (未验证) 提交于 2019-12-03 00:03:02
##DBSCAN import numpy as np import pandas as pd from sklearn.cluster import DBSCAN #1.导入数据 #data = pd.read_csv(’’) #2.数据预处理 #略,最终生成x_train,x_test x_train = np.array([[1, 2, 3], [1, 4, 6], [1, 0, 9], [4, 6, 1], [7, 8, 9], [4, 5, 6], [5, 1, 3], [5, 6, 2], [6, 2, 1]]) #3.模型训练 model = DBSCAN(eps=3, min_samples=2) model.fit(x_train) #4.模型预测 print(model.labels_) #参数列表与调参方法 DBSCAN(eps=0.5, min_samples=5, metric=’euclidean’, metric_params=None, algorithm=’auto’, leaf_size=30, p=None, n_jobs=None) #eps : float, optional 两个样本之间的最大距离,一个被认为是另一个样本的邻域。这不是群集中点的距离的最大界限。这是为您的数据集和距离函数选择适当的最重要的DBSCAN参数。 #min

matlab练习程序(DBSCAN)

匿名 (未验证) 提交于 2019-12-02 23:48:02
DBSCAN全称Density-Based Spatial Clustering of Applications with Noise,是一种密度聚类算法。 和Kmeans相比,不需要事先知道数据的类数。 以编程的角度来考虑,具体算法流程如下: 1.首先选择一个待处理数据。 2.寻找和待处理数据距离在设置半径内的数据。 3.将找到的半径内的数据放到一个队列中。 4.拿队列头数据作为当前待处理数据并不断执行第2步。 5.直到遍历完队列中所有数据,将这些数据记为一类。 6.选择没有处理到的数据作为一个待处理数据执行第2步。 7.直到遍历完所有数据,算法结束。 大概就是下图所示的样子: 我这里没有单独输出离群点,不过稍微改进增加离群点个数判断阈值应该就可以,比较容易修改。 代码如下: clear all; close all; clc; theta=0:0.01:2*pi; p1=[3*cos(theta) + rand(1,length(theta))/2;3*sin(theta)+ rand(1,length(theta))/2]; %生成测试数据 p2=[2*cos(theta) + rand(1,length(theta))/2;2*sin(theta)+ rand(1,length(theta))/2]; p3=[cos(theta) + rand(1,length(theta

常用数据挖掘算法-聚类

匿名 (未验证) 提交于 2019-12-02 23:38:02
概述 数据挖掘常又被称为价值发现或者是数据勘探,一般是指从大量的、不完全的、有噪声的、模糊的、随机的实际应用数据中,提取隐含在其中的,人们事先不知道的、但又是潜在有用的信息和知识的过程。它是一种大量数据的高级处理方式。 常用的数据挖掘算法分为四大类:聚类、分类、关联以及推荐算法。另外还有一个预处理:降维算法 聚类算法 聚类是在一群未知类别标号的样本上,用某种算法将他们分成若干类别,这是一种无监督学习。其主要研究数据间逻辑上或物理上的相互关系。由聚类所组成的簇是一组数据对象的集合,这些对象与同一簇中的对象彼此类似,与其他簇中的对象相异。其分析结果不仅可以揭示数据间的内在联系与区别,还可以为进一步的数据分析与知识发现提供重要依据。聚类算法的聚类效果如图所示 常用的聚类算法由k-means聚类算法、Canopy 聚类算法、FCM(Fuzzy C-Means,模糊C 均值)聚类算法、DBSCAN(Density-Based Spatial Clustering of Applications with Noise,具有噪声的基于密度的聚类方法)聚类算法、LDA(Latent Dirichlet Allocation,隐含狄利克雷分配)算法、层次聚类算法、基于EM(Expectation-Maximization,最大期望)的聚类算法等。以下将对上述聚类算法从算法的简介

Python实现DBSCAN聚类算法(简单样例测试)

匿名 (未验证) 提交于 2019-12-02 22:51:30
发现高密度的核心样品并从中膨胀团簇。 Python代码如下: 1 # -*- coding: utf-8 -*- 2 """ 3 Demo of DBSCAN clustering algorithm 4 Finds core samples of high density and expands clusters from them. 5 """ 6 print(__doc__) 7 # 引入相关包 8 import numpy as np 9 from sklearn.cluster import DBSCAN 10 from sklearn import metrics 11 from sklearn.datasets.samples_generator import make_blobs 12 from sklearn.preprocessing import StandardScaler 13 import matplotlib.pyplot as plt 14 # 初始化样本数据 15 centers = [[1, 1], [-1, -1], [1, -1]] 16 X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4, 17 random_state=0) 18 X =

Using ELKI's Distance Function

折月煮酒 提交于 2019-12-02 06:35:56
问题 This is a follow up from a previous question, where we commented that using euclidian distances with lat,long coordinates does not yeld correct results. I read in the documentation that ELKI enables geographic data, namely int its distance function, present in the various clustering algorithms. In the user interface of ELKI, I can see there are options to replace the default distance function (euclidian) by a better suited one. I also see that in that case, you need to provide a datum, which

Using a Geo Distance Function on ELKI

我的梦境 提交于 2019-12-02 01:20:33
I am using ELKI to mine some geospatial data (lat,long pairs) and I am quite concerned on using the right data types and algorithms. On the parameterizer of my algorithm, I tried to change the default distance function by a geo function (LngLatDistanceFunction, as I am using x,y data) as bellow: params.addParameter (DISTANCE_FUNCTION_ID, geo.LngLatDistanceFunction.class); However the results are quite surprising: it creates clusters of a repeated point, such as the example bellow: (2.17199922, 41.38190043, NaN), (2.17199922, 41.38190043, NaN), (2.17199922, 41.38190043, NaN), (2.17199922, 41

DBSCAN error with cosine metric in python

泪湿孤枕 提交于 2019-12-01 17:16:27
I was trying to use DBSCAN algorithm from scikit-learn library with cosine metric but was stuck with the error. The line of code is db = DBSCAN(eps=1, min_samples=2, metric='cosine').fit(X) where X is a csr_matrix . The error is the following: Metric 'cosine' not valid for algorithm 'auto', though the documentation says that it is possible to use this metric. I tried to use option algorithm='kd_tree' and 'ball_tree' but got the same. However, there is no error if I use euclidean or, say, l1 metric. The matrix X is large, so I can't use a precomputed matrix of pairwise distances. I use python 2

DBSCAN error with cosine metric in python

拥有回忆 提交于 2019-12-01 16:55:30
问题 I was trying to use DBSCAN algorithm from scikit-learn library with cosine metric but was stuck with the error. The line of code is db = DBSCAN(eps=1, min_samples=2, metric='cosine').fit(X) where X is a csr_matrix . The error is the following: Metric 'cosine' not valid for algorithm 'auto', though the documentation says that it is possible to use this metric. I tried to use option algorithm='kd_tree' and 'ball_tree' but got the same. However, there is no error if I use euclidean or, say, l1

What are some packages that implement semi-supervised (constrained) clustering?

佐手、 提交于 2019-11-30 09:42:38
I want to run some experiments on semi-supervised (constrained) clustering, in particular with background knowledge provided as instance level pairwise constraints (Must-Link or Cannot-Link constraints). I would like to know if there are any good open-source packages that implement semi-supervised clustering? I tried to look at PyBrain, mlpy, scikit and orange, and I couldn't find any constrained clustering algorithms. In particular, I'm interested in constrained K-Means or constrained density based clustering algorithms (like C-DBSCAN). Packages in Matlab, Python, Java or C++ would be

scikit-learn: clustering text documents using DBSCAN

不想你离开。 提交于 2019-11-30 04:47:48
I'm tryin to use scikit-learn to cluster text documents. On the whole, I find my way around, but I have my problems with specific issues. Most of the examples I found illustrate clustering using scikit-learn with k-means as clustering algorithm. Adopting these example with k-means to my setting works in principle. However, k-means is not suitable since I don't know the number of clusters. From what I read so far -- please correct me here if needed -- DBSCAN or MeanShift seem the be more appropriate in my case. The scikit-learn website provides examples for each cluster algorithm. The problem