data-mining

importance of PCA or SVD in machine learning

对着背影说爱祢 提交于 2019-11-28 15:07:49
问题 All this time (specially in Netflix contest), I always come across this blog (or leaderboard forum) where they mention how by applying a simple SVD step on data helped them in reducing sparsity in data or in general improved the performance of their algorithm in hand. I am trying to think (since long time) but I am not able to guess why is it so. In general, the data in hand I get is very noisy (which is also the fun part of bigdata) and then I do know some basic feature scaling stuff like

What is the difference between linear regression and logistic regression?

我的梦境 提交于 2019-11-28 14:55:36
When we have to predict the value of a categorical (or discrete) outcome we use logistic regression . I believe we use linear regression to also predict the value of an outcome given the input values. Then, what is the difference between the two methodologies? Linear regression output as probabilities It's tempting to use the linear regression output as probabilities but it's a mistake because the output can be negative, and greater than 1 whereas probability can not. As regression might actually produce probabilities that could be less than 0, or even bigger than 1, logistic regression was

Data Mining open source tools [closed]

别来无恙 提交于 2019-11-28 14:39:20
问题 I'm due to take up a project which is into data mining. Before I jump in I wanted to probe around for different data mining tools (preferably open source) which allows web based reporting. In my scenario the data would be provided to me, so I'm not supposed to crawl for it. In a nutshell, I am looking for a tool which does - Data Analysis, Web based Reporting, provides some kind of a dashboard and mining features. I have worked on the Microsoft Analysis Services and BOXI and off late I have

Calculate AUC in R?

泪湿孤枕 提交于 2019-11-28 14:11:18
问题 Given a vector of scores and a vector of actual class labels, how do you calculate a single-number AUC metric for a binary classifier in the R language or in simple English? Page 9 of "AUC: a Better Measure..." seems to require knowing the class labels, and here is an example in MATLAB where I don't understand R(Actual == 1)) Because R (not to be confused with the R language) is defined a vector but used as a function? 回答1: As mentioned by others, you can compute the AUC using the ROCR

NSL KDD Features from Raw Live Packets?

折月煮酒 提交于 2019-11-28 11:42:49
问题 I want to extract raw data using pcap and wincap. Since i will be testing it against a neural network trained with NSLKDD dataset, i want to know how to get those 41 attributes from raw data?.. or even if that is not possible is it possible to obtain features like src_bytes, dst host_same_srv_rate, diff_srv_rate, count, dst_host_serror_rate, wrong_fragment from raw live captured packets from pcap? 回答1: If someone would like to experiment with KDD '99 features despite the bad reputation of the

Cluster center mean of DBSCAN in R?

此生再无相见时 提交于 2019-11-28 08:28:28
问题 Using dbscan in package fpc I am able to get an output of: dbscan Pts=322 MinPts=20 eps=0.005 0 1 seed 0 233 border 87 2 total 87 235 but I need to find the cluster center (mean of cluster with most seeds). Can anyone show me how to proceed with this? 回答1: Just index back into the original data using the cluster ID of your choice. Then you can easily do whatever further processing you want to the subset. Here is an example: library(fpc) n = 100 set.seed(12345) data = matrix(rnorm(n*3), nrow=n

Why is Spark Mllib KMeans algorithm extremely slow?

孤人 提交于 2019-11-28 07:52:40
问题 I'm having the same problem as in this post, but I don't have enough points to add a comment there. My dataset has 1 Million rows, 100 cols. I'm using Mllib KMeans also and it is extremely slow. The job never finishes in fact and I have to kill it. I am running this on Google cloud (dataproc). It runs if I ask for a smaller number of clusters (k=1000), but still take more than 35 minutes. I need it to run for k~5000. I have no idea why is it so slow. The data is properly partitioned given the

How to find out if a sentence is a question (interrogative)?

南笙酒味 提交于 2019-11-28 05:23:05
Is there an open source Java library/algorithm for finding if a particular piece of text is a question or not? I am working on a question answering system that needs to analyze if the text input by user is a question. I think the problem can probably be solved by using opensource NLP libraries but its obviously more complicated than simple part of speech tagging. So if someone can instead tell the algorithm for it by using an existing opensource NLP library, that would be good too. Also let me know if you know a library/toolkit that uses data mining to solve this problem. Although it will be

How to deal with multiple class ROC analysis in R (pROC package)?

一曲冷凌霜 提交于 2019-11-28 05:09:47
问题 When I use multiclass.roc function in R (pROC package), for instance, I trained a data set by random forest, here is my code: # randomForest & pROC packages should be installed: # install.packages(c('randomForest', 'pROC')) data(iris) library(randomForest) library(pROC) set.seed(1000) # 3-class in response variable rf = randomForest(Species~., data = iris, ntree = 100) # predict(.., type = 'prob') returns a probability matrix multiclass.roc(iris$Species, predict(rf, iris, type = 'prob')) And

java framework for image pattern recognition?

倖福魔咒の 提交于 2019-11-28 05:05:45
I'm looking for a Java framework to help with some data mining specific to images. We have a set of historical images that I would like to categorize and classify. I'm was hoping to find something like weka http://www.cs.waikato.ac.nz/ml/weka/ or Marsyas http://marsyas.sness.net but more specific to sifting through image data to find patterns. Any suggestions? What about using the OpenCV library for Processing? Technically, Processing is not Java, but it runs on the JVM and shouldn't be difficult to get working. OpenCV is the standard choice for computer vision, which is what you're trying to