data-mining | 易学教程

How many principal components to take?

阅读更多关于 How many principal components to take?

问题 I know that principal component analysis does a SVD on a matrix and then generates an eigen value matrix. To select the principal components we have to take only the first few eigen values. Now, how do we decide on the number of eigen values that we should take from the eigen value matrix? 回答1: To decide how many eigenvalues/eigenvectors to keep, you should consider your reason for doing PCA in the first place. Are you doing it for reducing storage requirements, to reduce dimensionality for a

How do I extract keywords used in text? [closed]

阅读更多关于 How do I extract keywords used in text? [closed]

How do I data mine a pile of text to get keywords by usage? ("Jacob Smith" or "fence") And is there a software to do this already? even semi-automatically, and if it can filter out simple words like "the", "and", "or", then I could get to the topics quicker. The general algorithm is going to go like this: - Obtain Text - Strip punctuation, special characters, etc. - Strip "simple" words - Split on Spaces - Loop Over Split Text - Add word to Array/HashTable/Etc if it doesn't exist; if it does, increment counter for that word The end result is a frequency count of all words in the text. You can

How to optimal K in K - Means Algorithm [duplicate]

阅读更多关于 How to optimal K in K - Means Algorithm [duplicate]

Possible Duplicate: How do I determine k when using k-means clustering? How can i choose the K initially, if i do not know about the data? Can someone help me in choosing the K. Thanks Navin The base idea is to evaluate cluster scoring on sample data, usally it is distance inside cluster and distance between clusters. The more this measure the better clustering, based on this mesure you can select best clustring paramters. One of metrics can be found here http://alias-i.com/lingpipe/docs/api/com/aliasi/cluster/ClusterScore.html Felix Kling Seriously, what do you want to know? Do you want us to

Running DBSCAN in ELKI

阅读更多关于 Running DBSCAN in ELKI

问题 I am trying to cluster some geospatial data, and I previously tried the WEKA library. I found this benchmarking, and decided to try ELKI. Despite the advice to not use ELKI as a Java library (which is suppose to be less maintained than the UI), I incorporated it in my application, and I can say that I am quite happy about the results. The structures that it uses to store data, are far more efficient than the ones used by Weka, and the fact that it has the option of using a spatial index is

How does clustering (especially String clustering) work?

阅读更多关于 How does clustering (especially String clustering) work?

问题 I heard about clustering to group similar data. I want to know how it works in the specific case for String. I have a table with more than different 100,000 words. I want to identify the same word with some differences (eg.: house, house!!, hooouse, HoUse, @house, "house", etc... ). What is needed to identify the similarity and group each word in a cluster? What algorithm is more recommended for this? 回答1: To understand what clustering is imagine a geographical map. You can see many distinct

Kmeans without knowing the number of clusters? [duplicate]

阅读更多关于 Kmeans without knowing the number of clusters? [duplicate]

This question already has an answer here: How do I determine k when using k-means clustering? 17 answers I am attempting to apply k-means on a set of high-dimensional data points (about 50 dimensions) and was wondering if there are any implementations that find the optimal number of clusters. I remember reading somewhere that the way an algorithm generally does this is such that the inter-cluster distance is maximized and intra-cluster distance is minimized but I don't remember where I saw that. It would be great if someone can point me to any resources that discuss this. I am using SciPy for

Why does one hot encoding improve machine learning performance?

阅读更多关于 Why does one hot encoding improve machine learning performance?

I have noticed that when One Hot encoding is used on a particular data set (a matrix) and used as training data for learning algorithms, it gives significantly better results with respect to prediction accuracy, compared to using the original matrix itself as training data. How does this performance increase happen? Many learning algorithms either learn a single weight per feature, or they use distances between samples. The former is the case for linear models such as logistic regression, which are easy to explain. Suppose you have a dataset having only a single categorical feature

clustering very large dataset in R

阅读更多关于 clustering very large dataset in R

问题 I have a dataset consisting of 70,000 numeric values representing distances ranging from 0 till 50, and I want to cluster these numbers; however, if I'm trying the classical clustering approach, then I would have to establish a 70,000X70,000 distance matrix representing the distances between each two numbers in my dataset, which won't fit in memory, so I was wondering if there is any smart way to solve this problem without the need to do stratified sampling? I also tried bigmemory and big

How to find out if a sentence is a question (interrogative)?

阅读更多关于 How to find out if a sentence is a question (interrogative)?

问题 Is there an open source Java library/algorithm for finding if a particular piece of text is a question or not? I am working on a question answering system that needs to analyze if the text input by user is a question. I think the problem can probably be solved by using opensource NLP libraries but its obviously more complicated than simple part of speech tagging. So if someone can instead tell the algorithm for it by using an existing opensource NLP library, that would be good too. Also let

java framework for image pattern recognition?

阅读更多关于 java framework for image pattern recognition?

问题 I'm looking for a Java framework to help with some data mining specific to images. We have a set of historical images that I would like to categorize and classify. I'm was hoping to find something like weka http://www.cs.waikato.ac.nz/ml/weka/ or Marsyas http://marsyas.sness.net but more specific to sifting through image data to find patterns. Any suggestions? 回答1: What about using the OpenCV library for Processing? Technically, Processing is not Java, but it runs on the JVM and shouldn't be