data-mining

Regarding RandomTree in Weka

一笑奈何 提交于 2019-12-10 21:09:58
问题 I was playing around with weka when I observed a minNum field in the RandomTree configuration. I read the description which said "The minimum total weight of the instances in a leaf". However, I couldn't really understand what it means. I played around with that number, and I realized that when I increase it, the size of the tree thus generated reduces. I couldn't correlate as to why this is happening. Any help/references will be appreciated. 回答1: This has to do with the minimum number of

fast computation of Tomek link in R

≡放荡痞女 提交于 2019-12-10 18:17:10
问题 i want to implement tomek's link for dealing with unbalanced data. This code is used for binary classification problem, where the 1 class is the majority class and the 0 class is the minority. X the imput, Y the output I've written the following code but i'm looking for a way to speed up computation. How can i improve my code? ######################### #remove overlapping observation using tomek links #given observations i and j belonging to different classes #(i,j) is a Tomek link if there

How to get the selected features in GridSearchCV in sklearn in python

只谈情不闲聊 提交于 2019-12-10 17:03:49
问题 I am using recurive feature elimination with cross validation (rfecv) as the feature selection technique with GridSearchCV . My code is as follows. X = df[my_features_all] y = df['gold_standard'] x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=0) k_fold = StratifiedKFold(n_splits=5, shuffle=True, random_state=0) clf = RandomForestClassifier(random_state = 42, class_weight="balanced") rfecv = RFECV(estimator=clf, step=1, cv=k_fold, scoring='roc_auc') param_grid = {

Finding a correlation between variable and class variable

别等时光非礼了梦想. 提交于 2019-12-10 11:44:12
问题 I have a dataset which contains 7 numerical attributes and one nominal which is the class variable. I was wondering how I can the best attribute that can be used to predict the class attribute. Would finding the largest information gain by each attribute be the solution? 回答1: So the problem you are asking about falls under the domain of feature selection, and more broadly, feature engineering. There is a lot of literature online regarding this, and there are definitely a lot of blogs

How to improve the performance while operating with files in C

时光总嘲笑我的痴心妄想 提交于 2019-12-10 11:29:27
问题 I have implemented Naive Bayes algorithm on a large data set of 410k rows.Now all my records are getting classified correctly but the thing is the program is taking almost an hr to write the records into the corresponding files.What is the best way to improve performance of my code.Here is the below code.This piece of code is writing the 410k records into the corresponding files.Thank you. fp=fopen("sales_ok_fraud.txt","r"); while(fgets(line,80,fp)!=NULL) //Reading each line from file to

retrieve information from a url

倾然丶 夕夏残阳落幕 提交于 2019-12-10 11:08:50
问题 I want to make a program that will retrieve some information a url. For example i give the url below, from librarything How can i retrieve all the words below the "TAGS" tab, like Black Library fantasy Thanquol & Boneripper Thanquol and Bone Ripper Warhammer ? I am thinking of using java, and design a data mining wrapper, but i am not sure how to start. Can anyone give me some advice? EDIT: You gave me excellent help, but I want to ask something else. For every tag we can see how many times

How to plot/visualize a C50 decision tree in R?

匆匆过客 提交于 2019-12-10 03:04:12
问题 I am using the C50 decision tree algorithm. I am able to build the tree and get the summaries, but cannot figure out how to plot or viz the tree. My C50 model is called credit_model In other decision tree packages, I usually use something like plot(credit_model). In rpart it is rpart.plot(credit_model). What is the equivalent in the C50 algorithm to plot? 回答1: Right now, there are none built in. I've been working on an adapter for the partykit package (e.g. as.party ) but have not gotten very

What is Big Data & What classifies as Big data? [closed]

一曲冷凌霜 提交于 2019-12-10 00:08:06
问题 Closed . This question is opinion-based. It is not currently accepting answers. Want to improve this question? Update the question so it can be answered with facts and citations by editing this post. Closed 3 years ago . I have went through a lot of articles but I dont seem to get a perfectly clear answer on what exactly a BIG DATA is. In one page I saw "any data which is bigger for your usage, is big data i.e. 100 MB is considered big data for your mailbox but not your hard disc". Whereas

URL path similarity/string similarity algorithm

巧了我就是萌 提交于 2019-12-09 18:55:51
问题 My problem is that I need to compare URL paths and deduce if they are similar. Below I provide example data to process: # GROUP 1 /robots.txt # GROUP 2 /bot.html # GROUP 3 /phpMyAdmin-2.5.6-rc1/scripts/setup.php /phpMyAdmin-2.5.6-rc2/scripts/setup.php /phpMyAdmin-2.5.6/scripts/setup.php /phpMyAdmin-2.5.7-pl1/scripts/setup.php /phpMyAdmin-2.5.7/scripts/setup.php /phpMyAdmin-2.6.0-alpha/scripts/setup.php /phpMyAdmin-2.6.0-alpha2/scripts/setup.php # GROUP 4 //phpMyAdmin/ I tried Levenshtein

How to analyse a sparse adjacency matrix?

ⅰ亾dé卋堺 提交于 2019-12-09 18:16:41
问题 I am researching sparse adjacency matrices where most cells are zeros and some ones here-and-there, each relationship between two cells has a polynomial description that can be very long and their analysis manually time-consuming. My instructor is suggesting purely algebraic method in terms of Gröbner bases but before proceeding I would like to know from purely computer science and programming perspective about how to analyse sparse adjacency matrices? Does there exist some data mining tools