data-mining | 易学教程

Newbie: where to start given a problem to predict future success or not

阅读更多关于 Newbie: where to start given a problem to predict future success or not

问题 We have had a production web based product that allows users to make predictions about the future value (or demand) of goods, the historical data contains about 100k examples, each example has about 5 parameters; Consider a class of data called a prediciton: prediction { id: int predictor: int predictionDate: date predictedProductId: int predictedDirection: byte (0 for decrease, 1 for increase) valueAtPrediciton: float } and a paired result class that measures the result of the prediction:

retrieve information from a url

阅读更多关于 retrieve information from a url

I want to make a program that will retrieve some information a url. For example i give the url below, from librarything How can i retrieve all the words below the "TAGS" tab, like Black Library fantasy Thanquol & Boneripper Thanquol and Bone Ripper Warhammer ? I am thinking of using java, and design a data mining wrapper, but i am not sure how to start. Can anyone give me some advice? EDIT: You gave me excellent help, but I want to ask something else. For every tag we can see how many times each tag has been used, when we press the "number" button. How can I retrieve that number also? You

How to score a linear model using PMML file and Augustus on Python

阅读更多关于 How to score a linear model using PMML file and Augustus on Python

问题 I am new to python,PMML and augustus,so this question kind of newbie.I have a PMML file from which i want to score after every new iteration of data. I have to use Python with Augustus only to complete this excercise. I have read various articles some of them worth mentioning as they are good. (http://augustusdocs.appspot.com/docs/v06/model_abstraction/augustus_and_pmml.html , http://augustus.googlecode.com/svn-history/r191/trunk/augustus/modellib/regression/producer/Producer.py) I have read

Python multinomial logit with statsmodels module: Change base value of mlogit regression

阅读更多关于 Python multinomial logit with statsmodels module: Change base value of mlogit regression

问题 I have a little problem which I am stuck with. I am building a multinomial logit model with Python statsmodels and wish to reproduce an example given in a textbook. So far so good, but I am struggling with setting a different target value as the base value for the regression. Can somebody help?! import numpy as np import pandas as pd import statsmodels.api as sm import matplotlib.pyplot as plt #import data df = pd.read_excel('C:/.../diabetes.xlsx') #split the data in dependent and independent

how to get all terminal nodes - weight & response prediction 'ctree' in r

阅读更多关于 how to get all terminal nodes - weight & response prediction 'ctree' in r

问题 Here's what I can use to list weight for all terminal nodes : but how can I add some code to get response prediction as well as weight by each terminal node ID : say I want my output to look like this -- Here below is what I have so far to get the weight nodes(airct, unique(where(airct))) Thank you 回答1: The Binary tree is a big S4 object, so sometimes it is difficult to extract the data. But the plot method for BinaryTree object, hase an optional panel function of the form function(node)

Find HEX patterns and number of occurrences

阅读更多关于 Find HEX patterns and number of occurrences

问题 I'd like to find patterns and sort them by number of occurrences on an HEX file I have. I am not looking for some specific pattern, just to make some statistics of the occurrences happening there and sort them.

Algorithm for clustering people with similar interests

阅读更多关于 Algorithm for clustering people with similar interests

问题 I want to cluster people into groups based on their interests. For eg. people who like machine learning and graphs may be placed in a group and people who have interest in mathematics and economics etc. may be placed in a different group. The algorithm should be able to decide which people have most matching interests based on the interests of the people and create clusters.It should also be able to output about other persons in the group in which a particular person is placed. 回答1: This does

Web mining -classification algorithms

阅读更多关于 Web mining -classification algorithms

问题 my senior project is determining the dominant category of a web page.I crawled dmoz. now i am trying to build arff. After that i will use some feature extraction methods and classification algorithms. Do you know which feature extraction method performs good with any classification algorithm for web mining? 回答1: uClassify uses Bayesian Networks and claims to be able to categorize web pages. uClassify is a free web service where you can easily create your own text classifiers. Examples: Spam

how to write output from rapidminer to a txt file?

阅读更多关于 how to write output from rapidminer to a txt file?

i am using rapidminer 5.3.I took a small document which contains around three english sentences , tokenized it and filtered it with respect to the length of words.i want to write the output into a different word document.i tried using Write document utility but it is not working,it is simply writing the same original document into the new one.However when i write the output to the console,it gives me the expected answer.Something wrong with the write document utility. Here is my process READ DOCUMENT --> TOKENIZE --> FILTER TOKENS --> WRITE DOCUMENT Try the following Cut Document (with (\S+)

Principal Component Analysis on Weka

阅读更多关于 Principal Component Analysis on Weka

问题 I have just computed PCA on a training set and Weka returned me the new attributes with the way in which they were selected and computed. Now, I want to build a model using these data and then use the model on a test set. Do you know if there is a way to automatically modify the test set according to the new type of attributes? 回答1: Do you need the principal components for analysis or just to feed into the classifier? If not just use the Meta->FilteredClassifier classifier. Set the filter to