data-mining | 易学教程

Weka Clustering Results Differ for Same Settings

阅读更多关于 Weka Clustering Results Differ for Same Settings

问题 I am using Weka for clustering some data and was running into a very odd problem. When I use the normal "Cluster" Tool on a data set, I am getting a result of Cluster 1: 87 instances Cluster 2: 88 instances Cluster 3: 181 instances This is what I sort of expected from the data I had, so I consider this a good result. However, I want to add this cluster as a class and save it as a new .arff file, so I am trying to use the "Add Cluster" filter that Weka provides. Now, in this filter, I select

How can I use the rule-based learning algorithms for this example

阅读更多关于 How can I use the rule-based learning algorithms for this example

问题 I have data as follows in order to do a predictive learning as to what feature do people find attractive in a model when purchasing clothes online. So I have data as follows. COLORofCLOTHING MODELHAIR_COLOR MODEL_BUILD SELLER_CATEGORY Red Black Lean 1 Blue Brown Lean 5 Black Blonde Healthy 10 In order to predict if the clothing will sell well given a set of attributes. However seller category can be anything between 1 to 10 (1 being best and 10 being worst) I am not sure how to approach this

Clustering algorithm with different epsilons on different axes

阅读更多关于 Clustering algorithm with different epsilons on different axes

问题 I am looking for a clustering algorithm such a s DBSCAN do deal with 3d data, in which is possible to set different epsilons depending on the axis. So for instance an epsilon of 10m on the x-y plan, and an epsilon 0.2m on the z axis. Essentially, I am looking for large but flat clusters. Note: I am an archaeologist, the algorithm will be used to look for potential correlations between objects scattered in large surfaces, but in narrow vertical layers 回答1: Solution 1: Scale your data set to

How to skip 'die' in perl

阅读更多关于 How to skip 'die' in perl

问题 I am trying to extract data from website using perl API. The process is to use a list of uris as input. Then I extract related information for each uri from website. If the information for one uri is not present it dies. Some thing like the code below my @tags = $c->posts_for(uri =>"$currentURI"); die "No candidate related articles\n" unless @tags; Now, I don't want the program to stop if it doesn't get any tags. I want the program to skip that particular uri and go to the next available uri.

Integrating multiple dictionaries in python (big data)

阅读更多关于 Integrating multiple dictionaries in python (big data)

问题 I am working on a research project in big data mining. I have written the code currently to organize the data I have into a dictionary. However, The amount of data is so huge that while forming the dictionary, my computer runs out of memory. I need to periodically write my dictionary to main memory and create multiple dictionaries this way. I then need to compare the resulting multiple dictionaries, update the keys and values accordingly and store the whole thing in one big dictionary on disk

Sql server and R, data mining

阅读更多关于 Sql server and R, data mining

问题 I'm working on Microsoft SQL Management Studio 2016, using the feature that make me to add an R script into the SQL code. My goals is to achieve an aPriori algorithm procedure, that puts the data in a manner that I like, i.e. a table with x, first object, y, second object. I am stuck here, because in my opinion I have some problem in data. The error is this. A 'R' script error occurred during execution of 'sp_execute_external_script' with HRESULT 0x80004004. An external script error occurred:

how does postgres handle the bit data type?

阅读更多关于 how does postgres handle the bit data type?

问题 i have a table with a column vector of type bit(2000) . how does the db engine handle operations AND and OR over this values? does it simply divide into 32bit chunks (or 64, respectively) and then compares each chunk separately and in the end simply concats the results together? or does it handle simply as two strings? my point is to predict, which use case would be faster. i got a key-value data (user-item). userID | itemID U1 | I1 U1 | Ix Un | Ij for each user i want to calculate a list of

combination of smote and undersampling on weka

阅读更多关于 combination of smote and undersampling on weka

问题 according to paper which written by chawla, et al (2002) the best perfomance of balancing data is combining undersampling with SMOTE. I’ve tried to combine my dataset using under-sampling and SMOTE, but I am bit confuse about the attribute for under-sampling. In weka there is Resample to decrease the majority class. there is a attribute in Resample biasToUniformClass -- Whether to use bias towards a uniform class. A value of 0 leaves the class distribution as-is, a value of 1 ensures the

facial expression classification using k-means

阅读更多关于 facial expression classification using k-means

问题 My method for classifying facial expressions using k-means is: Use opencv to detect the face in the image Use ASM and stasm to get the facial feature point Calculate the distance between facial features (as show in the picture). There'll be 5 distances. Calculate the centroid for each distance for each facial expression (exp: in the distance D1 there are 7 centroids for each expression 'happy, angry...'). Use 5 k-means each k-means for a distance and each k-means will have as a result the

show volume in each node using ctree , plot in R

阅读更多关于 show volume in each node using ctree , plot in R

问题 can any one please show me how to add volume in each of the nodes , instead of the final node volume t <- ctree(is_return ~ a + b + c) plot(t, type="simple") and my tree would look like how can I modified that plot where it would show N= on every circle nodes , not only the black or the final node. Thanks 回答1: The idea is to specify a panel functions for plotting inner nodes. I generate some data, and the tree lls <- data.frame(N = gl(3, 50, labels = c("A", "B", "C")), a = rnorm(150) + rep(c