data-mining

R Random Forests Variable Importance

谁都会走 提交于 2019-11-30 06:09:28
问题 I am trying to use the random forests package for classification in R. The Variable Importance Measures listed are: mean raw importance score of variable x for class 0 mean raw importance score of variable x for class 1 MeanDecreaseAccuracy MeanDecreaseGini Now I know what these "mean" as in I know their definitions. What I want to know is how to use them. What I really want to know is what these values mean in only the context of how accurate they are, what is a good value, what is a bad

What is the difference between linear regression and logistic regression?

时间秒杀一切 提交于 2019-11-30 06:09:27
问题 When we have to predict the value of a categorical (or discrete) outcome we use logistic regression. I believe we use linear regression to also predict the value of an outcome given the input values. Then, what is the difference between the two methodologies? 回答1: Linear regression output as probabilities It's tempting to use the linear regression output as probabilities but it's a mistake because the output can be negative, and greater than 1 whereas probability can not. As regression might

Scikit-learn: How to run KMeans on a one-dimensional array?

假装没事ソ 提交于 2019-11-30 04:48:21
I have an array of 13.876(13,876) values between 0 and 1. I would like to apply sklearn.cluster.KMeans to only this vector to find the different clusters in which the values are grouped. However, it seems KMeans works with a multidimensional array and not with one-dimensional ones. I guess there is a trick to make it work but I don't know how. I saw that KMeans.fit() accepts "X : array-like or sparse matrix, shape=(n_samples, n_features)" , but it wants the n_samples to be bigger than one I tried putting my array on a np.zeros() matrix and run KMeans, but then is putting all the non-null

scikit-learn: clustering text documents using DBSCAN

不想你离开。 提交于 2019-11-30 04:47:48
I'm tryin to use scikit-learn to cluster text documents. On the whole, I find my way around, but I have my problems with specific issues. Most of the examples I found illustrate clustering using scikit-learn with k-means as clustering algorithm. Adopting these example with k-means to my setting works in principle. However, k-means is not suitable since I don't know the number of clusters. From what I read so far -- please correct me here if needed -- DBSCAN or MeanShift seem the be more appropriate in my case. The scikit-learn website provides examples for each cluster algorithm. The problem

Bytes vs Characters vs Words - which granularity for n-grams?

[亡魂溺海] 提交于 2019-11-30 03:55:01
问题 At least 3 types of n-grams can be considered for representing text documents: byte-level n-grams character-level n-grams word-level n-grams It's unclear to me which one should be used for a given task (clustering, classification, etc). I read somewhere that character-level n-grams are preferred to word-level n-grams when the text contains typos, so that "Mary loves dogs" remains similar to "Mary lpves dogs". Are there other criteria to consider for choosing the "right" representation? 回答1:

dbscan - setting limit on maximum cluster span

China☆狼群 提交于 2019-11-30 03:53:58
By my understanding of DBSCAN, it's possible for you to specify an epsilon of, say, 100 meters and — because DBSCAN takes into account density-reachability and not direct density-reachability when finding clusters — end up with a cluster in which the maximum distance between any two points is > 100 meters. In a more extreme possibility, it seems possible that you could set epsilon of 100 meters and end up with a cluster of 1 kilometer: see [2][6] in this array of images from scikit learn for an example of when that might occur. (I'm more than willing to be told I'm a total idiot and am

PCA For categorical features?

允我心安 提交于 2019-11-30 03:16:49
In my understanding, I thought PCA can be performed only for continuous features. But while trying to understand the difference between onehot encoding and label encoding came through a post in the following link: When to use One Hot Encoding vs LabelEncoder vs DictVectorizor? It states that one hot encoding followed by PCA is a very good method, which basically means PCA is applied for categorical features. Hence confused, please suggest me on the same. I disagree with the others. While you can use PCA on binary data (e.g. one-hot encoded data) that does not mean it is a good thing, or it

what is the bootstrapped data in data mining?

穿精又带淫゛_ 提交于 2019-11-30 02:59:10
问题 recently I came across this term,but really have no idea what it refers to.I've searched online,but with little gain. Thanks. 回答1: If you don't have enough data to train your algorithm you can increase the size of your training set by (uniformly) randomly selecting items and duplicating them (with replacement). 回答2: Take a sample of the time of day that you wake up on Saturdays. Some Friday nights you have a few too many drinks, so you wake up early (but go back to bed). Other days you wake

Trajectory Clustering: Which Clustering Method?

自古美人都是妖i 提交于 2019-11-29 23:16:57
As a newbie in Machine Learning, I have a set of trajectories that may be of different lengths. I wish to cluster them, because some of them are actually the same path and they just SEEM different due to the noise. In addition, not all of them are of the same lengths . So maybe although Trajectory A is not the same as Trajectory B, yet it is part of Trajectory B. I wish to present this property after the clustering as well. I have only a bit knowledge of K-means Clustering and Fuzzy N-means Clustering . How may I choose between them two? Or should I adopt other methods? Any method that takes

How can I perform K-means clustering on time series data?

烂漫一生 提交于 2019-11-29 23:09:35
How can I do K-means clustering of time series data? I understand how this works when the input data is a set of points, but I don't know how to cluster a time series with 1XM, where M is the data length. In particular, I'm not sure how to update the mean of the cluster for time series data. I have a set of labelled time series, and I want to use the K-means algorithm to check whether I will get back a similar label or not. My X matrix will be N X M, where N is number of time series and M is data length as mentioned above. Does anyone know how to do this? For example, how could I modify this k