data-mining

Weka simple K-means clustering assignments

∥☆過路亽.° 提交于 2019-12-03 07:24:27
问题 I have what feels like a simple problem, but I can't seem to find an answer. I'm pretty new to Weka, but I feel like I've done a bit of research on this (at least read through the first couple of pages of Google results) and come up dry. I am using Weka to run clustering using Simple K-Means. In the results list I have no problem visualizing my output ("Visualize cluster assignments") and it is clear both from my understanding of the K-Means algorithm and the output of Weka that each of my

Historical weather data from NOAA

被刻印的时光 ゝ 提交于 2019-12-03 07:15:24
I am working on a data mining project and I would like to gather historical weather data. I am able to get historical data through the web interface that they provide at http://www.ncdc.noaa.gov/cdo-web/search . But I would like to access this data programmatically through an API. From what I have been reading on StackOverflow this data is supposed to be public domain, but the only place I have been able to find it is on non-free services like Wunderground. How can I access this data for free? For a list of all service APIs provided by the National Climatic Data Center: http://www.ncdc.noaa

PCA For categorical features?

筅森魡賤 提交于 2019-12-03 06:57:42
问题 In my understanding, I thought PCA can be performed only for continuous features. But while trying to understand the difference between onehot encoding and label encoding came through a post in the following link: When to use One Hot Encoding vs LabelEncoder vs DictVectorizor? It states that one hot encoding followed by PCA is a very good method, which basically means PCA is applied for categorical features. Hence confused, please suggest me on the same. 回答1: I disagree with the others. While

What is Java Data Mining, JDM?

為{幸葍}努か 提交于 2019-12-03 06:55:27
I am looking at JDM. Is this simply an API to interact with other tools that do the actual data mining? Or is this a set of packages that contain the actual data mining algorithms? Ah, the wonders of the interweb : Java Data Mining (JDM) is a standard Java API for developing data mining applications and tools. JDM defines an object model and Java API for data mining objects and processes. JDM enables applications to integrate data mining technology for developing predictive analytics applications and tools. The JDM 1.0 standard was developed under the Java Community Process as JSR 73. As of

Clustering Algorithm with discrete and continuous attributes?

三世轮回 提交于 2019-12-03 06:41:30
Does anyone know a good algorithm for perform clustering on both discrete and continuous attributes? I am working on a problem of identifying a group of similar customers and each customer has both discrete and continuous attributes (Think type of customers, amount of revenue generated by this customer, geographic location and etc..) Traditionally algorithm like K-means or EM work for continuous attributes, what if we have a mix of continuous and discrete attributes? If I remember correctly, then COBWEB algorithm could work with discrete attributes. And you can also do different 'tricks' to

How would you group/cluster these three areas in arrays in python?

一个人想着一个人 提交于 2019-12-03 05:55:20
问题 So you have an array 1 2 3 60 70 80 100 220 230 250 For a better understanding: How would you group/cluster the three areas in arrays in python(v2.6), so you get three arrays in this case containing [1 2 3] [60 70 80 100] [220 230 250] Background: y-axis is frequency, x-axis is number. These numbers are the ten highest amplitudes being represented by their frequencies. I want to create three discrete numbers from them for pattern recognition. There could be many more points but all of them

What does dimensionality reduction mean?

China☆狼群 提交于 2019-12-03 05:24:17
What does dimensionality reduction mean exactly? I searched for its meaning, I just found that it means the transformation of raw data into a more useful form. So what is the benefit of having data in useful form, I mean how can I use it in a practical life (application)? Dimensionality Reduction is about converting data of very high dimensionality into data of much lower dimensionality such that each of the lower dimensions convey much more information. This is typically done while solving machine learning problems to get better features for a classification or regression task. Heres a

Hadoop Machine learning/Data mining project idea? [closed]

瘦欲@ 提交于 2019-12-03 05:11:13
问题 As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance. Closed 6 years ago . I am a graduate CS student (Data mining and machine learning) and have a good exposure to core Java (>4 years). I have read up a bunch

What FFT descriptors should be used as feature to implement classification or clustering algorithm?

我的未来我决定 提交于 2019-12-03 04:36:30
问题 I have some geographical trajectories sampled to analyze, and I calculated the histogram of data in spatial and temporal dimension, which yielded a time domain based feature for each spatial element. I want to perform a discrete FFT to transform the time domain based feature into frequency domain based feature (which I think maybe more robust), and then do some classification or clustering algorithms. But I'm not sure using what descriptor as frequency domain based feature, since there are

Use feedback or reinforcement in machine learning?

こ雲淡風輕ζ 提交于 2019-12-03 03:54:03
问题 I am trying to solve some classification problem. It seems many classical approaches follow a similar paradigm. That is, train a model with some training set and than use it to predict the class labels for new instances. I am wondering if it is possible to introduce some feedback mechanism into the paradigm. In control theory, introducing a feedback loop is an effective way to improve system performance. Currently a straight forward approach on my mind is, first we start with a initial set of