data-mining

k means clustering algorithm

白昼怎懂夜的黑 提交于 2019-12-09 13:51:13
问题 I want to perform a k means clustering analysis on a set of 10 data points that each have an array of 4 numeric values associated with them. I'm using the Pearson correlation coefficient as the distance metric. I did the first two steps of the k means clustering algorithm which were: 1) Select a set of initial centres of k clusters. [I selected two initial centres at random] 2) Assign each object to the cluster with the closest centre. [I used the Pearson correlation coefficient as the

Use Absolute Pearson Correlation as Distance in K-Means Algorithm (MATLAB)

江枫思渺然 提交于 2019-12-09 13:44:30
问题 I need to do some clustering using a correlation distance but instead of using the built-in 'distance' 'correlation' which is defined as d=1-r i need the absolute pearson distance.In my aplication anti-correlated data should get the same cluter ID. And now when using the kmeans() function im getting centroids that are highly anticorreleted wich i would like to avoid by combineing them. Now, im not that fluent in matlab yet and have some problems reading the kmeans function. Would it be

How to deal with missing attribute values in C4.5 (J48) decision tree?

你离开我真会死。 提交于 2019-12-09 12:44:48
问题 What's the best way to handle missing feature attribute values with Weka's C4.5 (J48) decision tree? The problem of missing values occurs during both training and classification. If values are missing from training instances, am I correct in assuming that I place a '?' value for the feature? Suppose that I am able to successfully build the decision tree and then create my own tree code in C++ or Java from Weka's tree structure. During classification time, if I am trying to classify a new

Clustering Algorithm with discrete and continuous attributes?

帅比萌擦擦* 提交于 2019-12-09 05:27:55
问题 Does anyone know a good algorithm for perform clustering on both discrete and continuous attributes? I am working on a problem of identifying a group of similar customers and each customer has both discrete and continuous attributes (Think type of customers, amount of revenue generated by this customer, geographic location and etc..) Traditionally algorithm like K-means or EM work for continuous attributes, what if we have a mix of continuous and discrete attributes? 回答1: If I remember

Dataset for data mining project [closed]

不问归期 提交于 2019-12-08 15:18:37
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 6 years ago . I am searching for some datasets in form of 0 and 1 . i cant find dataset. i have find some 10 to 12 records but want at least 100 records and 8 different records. This is one i have searched this link but this is very low data http://searchbusinessanalytics.techtarget.com/feature/Simple-data-mining-examples-and

How To Call WekaSharp Commands From C#

浪尽此生 提交于 2019-12-08 12:59:43
问题 Thank you all for you help on my F# and C# question and am really beginning to enjoy the fruits of the learning. I have asked a question like this before and I know these are on the line of the purposes of this forum, but I think this is useful and would provide some help to all us F# data miners out there :-). I am utilizing Yin Zhu's WekaSharp for my experiments and am interested in the rates of computation between F# and C#. I have written a snippet based on his example in the F# and would

creating k -itemsets from 2-itemsets

血红的双手。 提交于 2019-12-08 10:23:17
问题 I have written the following code to generate k-elements itemsets from 2-element sets. The two elements sets are passed to candidateItemsetGen as clist1 and clist2. public static void candidateItemsetGen(ArrayList<Integer> clist1, ArrayList<Integer> clist2) { for(int i = 0; i < clist1.size(); i++) { for(int j = i+1; j < clist2.size(); j++) { for(int k = 0; k < clist1.size()-2; k++) { int r = clist1.get(k).compareTo(clist2.get(k)); if(r == 0 && clist1.get(k)-1 == clist2.get(k)-1) { **

Web Scraping, data mining, data extraction

一个人想着一个人 提交于 2019-12-08 06:52:59
问题 I am tasked with creating a web scraping software, and I don't know where to even begin. Any help would be appreciated, even just telling me how this data is organized, or what "type" of data layout the website is using would help, because I would be able to Google search that term. http://utilsub.lbs.ubc.ca/ion/default.aspx?dgm=x-pml:/diagrams/ud/Default/7330_FAC-delta_V2.4.1/7330_FAC-delta_V2.4.1-pq.dgm&node=Buildings.Angus_addition&logServerName=QUERYSERVER.UTIL2SUB&logServerHandle=327952

How to find “equivalent” texts?

空扰寡人 提交于 2019-12-08 06:47:56
问题 I want to find (not generate) 2 text strings such that, after removing all non letters and ucasing, one string can be translated to the other by simple substitution. The motivation for this comes from a project I known of that is testing methods for attacking cyphers via probability distributions. I'd like to find a large, coherent plain text that, once encrypted with a simple substitution cypher, can be decrypted to something else that is also coherent. This ends up as 2 parts, find the

How to get several columns from BigQuery?

情到浓时终转凉″ 提交于 2019-12-08 05:13:18
问题 I am querying the github public dataset on BigQuery. Currently, my best query for what I need looks like the following. SELECT type, created_at, repository_name FROM [githubarchive:github.timeline] WHERE (created_at CONTAINS '2012-') AND repository_owner="twitter" ORDER BY created_at, repository_name; This gives me all the events ("type") from the repository_owner twitter (or any other user) for all the repositories ("repository_name") that this user owns, but in a single column. However,