data-mining

Bytes vs Characters vs Words - which granularity for n-grams?

北慕城南 提交于 2019-11-30 20:34:37
At least 3 types of n-grams can be considered for representing text documents: byte-level n-grams character-level n-grams word-level n-grams It's unclear to me which one should be used for a given task (clustering, classification, etc). I read somewhere that character-level n-grams are preferred to word-level n-grams when the text contains typos, so that "Mary loves dogs" remains similar to "Mary lpves dogs". Are there other criteria to consider for choosing the "right" representation? Evaluate . The criterion for choosing the representation is whatever works . Indeed, character level (!=

clustering on very large sparse matrix?

社会主义新天地 提交于 2019-11-30 18:49:58
问题 I am trying to do some (k-means) clustering on a very large matrix. The matrix is approximately 500000 rows x 4000 cols yet very sparse (only a couple of "1" values per row). I want to get around 2000 clusters. I got two questions: - Can someone recommend an open source platform or tool for doing that (maybe using k-means, maybe with something better)? - How can I best estimate the time the algorithm will need to finish? I tried weka once, but aborted the job after a couple of days because I

what is the bootstrapped data in data mining?

℡╲_俬逩灬. 提交于 2019-11-30 18:36:12
recently I came across this term,but really have no idea what it refers to.I've searched online,but with little gain. Thanks. If you don't have enough data to train your algorithm you can increase the size of your training set by (uniformly) randomly selecting items and duplicating them (with replacement). Take a sample of the time of day that you wake up on Saturdays. Some Friday nights you have a few too many drinks, so you wake up early (but go back to bed). Other days you wake up at a normal time. Other days you sleep in. Here are the results: [3.1, 4.8, 6.3, 6.4, 6.6, 7.3, 7.5, 7.7, 7.9,

How to select top 100 features(a subset) which are most relevant after pca?

邮差的信 提交于 2019-11-30 18:10:50
问题 I performed PCA on a 63*2308 matrix and obtained a score and a co-efficient matrix. The score matrix is 63*2308 and the co-efficient matrix is 2308*2308 in dimensions. How do i extract the column names for the top 100 features which are most important so that i can perform regression on them? 回答1: PCA should give you both a set of eigenvectors (your co-efficient matrix) and a vector of eigenvalues (1*2308) often referred to as lambda). You might been to use a different PCA function in matlab

What techniques/tools are there for discovering common phrases in chunks of text?

醉酒当歌 提交于 2019-11-30 15:28:20
问题 Lets say I have 100000 email bodies and 2000 of them contains an abitrary common string like "the quick brown fox jumps over the lazy dog" or "lorem ipsum dolor sit amet". What techniques could/should I use to "mine" these phrases? I'm not interested in mining single words or short phrases. Also I need to filter out phrases that I already know occur in all mails. Example: string mailbody1 = "Welcome to the world of tomorrow! This is the first mail body. Lorem ipsum dolor sit AMET. Have a nice

What techniques/tools are there for discovering common phrases in chunks of text?

不羁的心 提交于 2019-11-30 14:37:57
Lets say I have 100000 email bodies and 2000 of them contains an abitrary common string like "the quick brown fox jumps over the lazy dog" or "lorem ipsum dolor sit amet". What techniques could/should I use to "mine" these phrases? I'm not interested in mining single words or short phrases. Also I need to filter out phrases that I already know occur in all mails. Example: string mailbody1 = "Welcome to the world of tomorrow! This is the first mail body. Lorem ipsum dolor sit AMET. Have a nice day dude. Cya!"; string mailbody2 = "Welcome to the world of yesterday! Lorem ipsum dolor sit amet

Trajectory Clustering: Which Clustering Method?

狂风中的少年 提交于 2019-11-30 10:49:35
问题 As a newbie in Machine Learning, I have a set of trajectories that may be of different lengths. I wish to cluster them, because some of them are actually the same path and they just SEEM different due to the noise. In addition, not all of them are of the same lengths . So maybe although Trajectory A is not the same as Trajectory B, yet it is part of Trajectory B. I wish to present this property after the clustering as well. I have only a bit knowledge of K-means Clustering and Fuzzy N-means

Can k-means clustering do classification?

試著忘記壹切 提交于 2019-11-30 08:32:48
I want to know whether the k-means clustering algorithm can do classification? If I have done a simple k-means clustering . Assume I have many data , I use k-means clusterings, then get 2 clusters A, B. and the centroid calculating method is Euclidean distance. Cluster A at left side. Cluster B at right side. So, if I have one new data. What should I do? Run k-means clustering algorithm again, and can get which cluster does the new data belong to? Record the last centroid and use Euclidean distance to calculating to decide the new data belong to? other method? The simplest method of course is

'Similarity' in Data Mining

安稳与你 提交于 2019-11-30 07:09:15
In the field of Data Mining, is there a specific sub-discipline called 'Similarity'? If yes, what does it deal with. Any examples, links, references will be helpful. Also, being new to the field, I would like the community opinion on how closely related Data Mining and Artificial Intelligence are. Are they synonyms, is one the subset of the other? Thanks in advance for sharing your knowledge. Yin Zhu In the field of Data Mining, is there a specific sub-discipline called 'Similarity'? Yes. There is a specific subfield in data mining and machine learning called metric learning, which aims to

How would one use Kernel Density Estimation as a 1D clustering method in scikit learn?

余生长醉 提交于 2019-11-30 06:20:25
问题 I need to cluster a simple univariate data set into a preset number of clusters. Technically it would be closer to binning or sorting the data since it is only 1D, but my boss is calling it clustering, so I'm going to stick to that name. The current method used by the system I'm on is K-means, but that seems like overkill. Is there a better way of performing this task? Answers to some other posts are mentioning KDE (Kernel Density Estimation), but that is a density estimation method, how