data-mining | 易学教程

Regarding RandomTree in Weka

阅读更多关于 Regarding RandomTree in Weka

问题 I was playing around with weka when I observed a minNum field in the RandomTree configuration. I read the description which said "The minimum total weight of the instances in a leaf". However, I couldn't really understand what it means. I played around with that number, and I realized that when I increase it, the size of the tree thus generated reduces. I couldn't correlate as to why this is happening. Any help/references will be appreciated. 回答1: This has to do with the minimum number of

fast computation of Tomek link in R

阅读更多关于 fast computation of Tomek link in R

问题 i want to implement tomek's link for dealing with unbalanced data. This code is used for binary classification problem, where the 1 class is the majority class and the 0 class is the minority. X the imput, Y the output I've written the following code but i'm looking for a way to speed up computation. How can i improve my code? ######################### #remove overlapping observation using tomek links #given observations i and j belonging to different classes #(i,j) is a Tomek link if there

How to get the selected features in GridSearchCV in sklearn in python

阅读更多关于 How to get the selected features in GridSearchCV in sklearn in python

问题 I am using recurive feature elimination with cross validation (rfecv) as the feature selection technique with GridSearchCV . My code is as follows. X = df[my_features_all] y = df['gold_standard'] x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=0) k_fold = StratifiedKFold(n_splits=5, shuffle=True, random_state=0) clf = RandomForestClassifier(random_state = 42, class_weight="balanced") rfecv = RFECV(estimator=clf, step=1, cv=k_fold, scoring='roc_auc') param_grid = {

Finding a correlation between variable and class variable

阅读更多关于 Finding a correlation between variable and class variable

问题 I have a dataset which contains 7 numerical attributes and one nominal which is the class variable. I was wondering how I can the best attribute that can be used to predict the class attribute. Would finding the largest information gain by each attribute be the solution? 回答1: So the problem you are asking about falls under the domain of feature selection, and more broadly, feature engineering. There is a lot of literature online regarding this, and there are definitely a lot of blogs

How to improve the performance while operating with files in C

阅读更多关于 How to improve the performance while operating with files in C

问题 I have implemented Naive Bayes algorithm on a large data set of 410k rows.Now all my records are getting classified correctly but the thing is the program is taking almost an hr to write the records into the corresponding files.What is the best way to improve performance of my code.Here is the below code.This piece of code is writing the 410k records into the corresponding files.Thank you. fp=fopen("sales_ok_fraud.txt","r"); while(fgets(line,80,fp)!=NULL) //Reading each line from file to

retrieve information from a url

阅读更多关于 retrieve information from a url

问题 I want to make a program that will retrieve some information a url. For example i give the url below, from librarything How can i retrieve all the words below the "TAGS" tab, like Black Library fantasy Thanquol & Boneripper Thanquol and Bone Ripper Warhammer ? I am thinking of using java, and design a data mining wrapper, but i am not sure how to start. Can anyone give me some advice? EDIT: You gave me excellent help, but I want to ask something else. For every tag we can see how many times

How to plot/visualize a C50 decision tree in R?

阅读更多关于 How to plot/visualize a C50 decision tree in R?

问题 I am using the C50 decision tree algorithm. I am able to build the tree and get the summaries, but cannot figure out how to plot or viz the tree. My C50 model is called credit_model In other decision tree packages, I usually use something like plot(credit_model). In rpart it is rpart.plot(credit_model). What is the equivalent in the C50 algorithm to plot? 回答1: Right now, there are none built in. I've been working on an adapter for the partykit package (e.g. as.party ) but have not gotten very

What is Big Data & What classifies as Big data? [closed]

阅读更多关于 What is Big Data & What classifies as Big data? [closed]

问题 Closed . This question is opinion-based. It is not currently accepting answers. Want to improve this question? Update the question so it can be answered with facts and citations by editing this post. Closed 3 years ago . I have went through a lot of articles but I dont seem to get a perfectly clear answer on what exactly a BIG DATA is. In one page I saw "any data which is bigger for your usage, is big data i.e. 100 MB is considered big data for your mailbox but not your hard disc". Whereas

URL path similarity/string similarity algorithm

阅读更多关于 URL path similarity/string similarity algorithm

问题 My problem is that I need to compare URL paths and deduce if they are similar. Below I provide example data to process: # GROUP 1 /robots.txt # GROUP 2 /bot.html # GROUP 3 /phpMyAdmin-2.5.6-rc1/scripts/setup.php /phpMyAdmin-2.5.6-rc2/scripts/setup.php /phpMyAdmin-2.5.6/scripts/setup.php /phpMyAdmin-2.5.7-pl1/scripts/setup.php /phpMyAdmin-2.5.7/scripts/setup.php /phpMyAdmin-2.6.0-alpha/scripts/setup.php /phpMyAdmin-2.6.0-alpha2/scripts/setup.php # GROUP 4 //phpMyAdmin/ I tried Levenshtein

How to analyse a sparse adjacency matrix?

阅读更多关于 How to analyse a sparse adjacency matrix?

问题 I am researching sparse adjacency matrices where most cells are zeros and some ones here-and-there, each relationship between two cells has a polynomial description that can be very long and their analysis manually time-consuming. My instructor is suggesting purely algebraic method in terms of Gröbner bases but before proceeding I would like to know from purely computer science and programming perspective about how to analyse sparse adjacency matrices? Does there exist some data mining tools