data-mining

Ways to calculate similarity

非 Y 不嫁゛ 提交于 2019-11-29 21:49:30
I am doing a community website that requires me to calculate the similarity between any two users. Each user is described with the following attributes: age, skin type (oily, dry), hair type (long, short, medium), lifestyle (active outdoor lover, TV junky) and others. Can anyone tell me how to go about this problem or point me to some resources? George Dontas Another way of computing (in R ) all the pairwise dissimilarities (distances) between observations in the data set. The original variables may be of mixed types. The handling of nominal, ordinal, and (a)symmetric binary data is achieved

What are some good ways of estimating 'approximate' semantic similarity between sentences?

孤人 提交于 2019-11-29 20:51:55
I have been looking at the nlp tag on SO for the past couple of hours and am confident I did not miss anything but if I did, please do point me to the question. In the mean time though, I will describe what I am trying to do. A common notion that I observed on many posts is that semantic similarity is difficult. For instance, from this post, the accepted solution suggests the following: First of all, neither from the perspective of computational linguistics nor of theoretical linguistics is it clear what the term 'semantic similarity' means exactly. .... Consider these examples: Pete and Rob

Information retrieval (IR) vs data mining vs Machine Learning (ML)

你说的曾经没有我的故事 提交于 2019-11-29 20:28:54
People often throw around the terms IR, ML, and data mining, but I have noticed a lot of overlap between them. From people with experience in these fields, what exactly draws the line between these? doug This is just the view of one person (formally trained in ML); others might see things quite differently. Machine Learning is probably the most homogeneous of these three terms, and the most consistently applied--it's limited to the pattern-extraction (or pattern-matching) algorithms themselves. Of the terms you mentioned, "Machine Learning" is the one most used by Academic Departments to

Data Mining open source tools [closed]

房东的猫 提交于 2019-11-29 19:47:26
I'm due to take up a project which is into data mining. Before I jump in I wanted to probe around for different data mining tools (preferably open source) which allows web based reporting. In my scenario the data would be provided to me, so I'm not supposed to crawl for it. In a nutshell, I am looking for a tool which does - Data Analysis, Web based Reporting, provides some kind of a dashboard and mining features. I have worked on the Microsoft Analysis Services and BOXI and off late I have been looking at Pentaho, which seems to be a good option. Please share your experiences on any such tool

importance of PCA or SVD in machine learning

耗尽温柔 提交于 2019-11-29 19:33:41
All this time (specially in Netflix contest), I always come across this blog (or leaderboard forum) where they mention how by applying a simple SVD step on data helped them in reducing sparsity in data or in general improved the performance of their algorithm in hand. I am trying to think (since long time) but I am not able to guess why is it so. In general, the data in hand I get is very noisy (which is also the fun part of bigdata) and then I do know some basic feature scaling stuff like log-transformation stuff , mean normalization. But how does something like SVD helps. So lets say i have

Calculate AUC in R?

空扰寡人 提交于 2019-11-29 18:57:15
Given a vector of scores and a vector of actual class labels, how do you calculate a single-number AUC metric for a binary classifier in the R language or in simple English? Page 9 of "AUC: a Better Measure..." seems to require knowing the class labels, and here is an example in MATLAB where I don't understand R(Actual == 1)) Because R (not to be confused with the R language) is defined a vector but used as a function? As mentioned by others, you can compute the AUC using the ROCR package. With the ROCR package you can also plot the ROC curve, lift curve and other model selection measures. You

What is an intuitive explanation of the Expectation Maximization technique? [closed]

蹲街弑〆低调 提交于 2019-11-29 18:34:10
Expectation Maximization (EM) is a kind of probabilistic method to classify data. Please correct me if I am wrong if it is not a classifier. What is an intuitive explanation of this EM technique? What is expectation here and what is being maximized ? Note: the code behind this answer can be found here . Suppose we have some data sampled from two different groups, red and blue: Here, we can see which data point belongs to the red or blue group. This makes it easy to find the parameters that characterise each group. For example, the mean of the red group is around 3, the mean of the blue group

NSL KDD Features from Raw Live Packets?

南楼画角 提交于 2019-11-29 16:53:42
I want to extract raw data using pcap and wincap. Since i will be testing it against a neural network trained with NSLKDD dataset, i want to know how to get those 41 attributes from raw data?.. or even if that is not possible is it possible to obtain features like src_bytes, dst host_same_srv_rate, diff_srv_rate, count, dst_host_serror_rate, wrong_fragment from raw live captured packets from pcap? If someone would like to experiment with KDD '99 features despite the bad reputation of the dataset, I created a tool named kdd99extractor to extract subset of KDD features from live traffic or .pcap

Cluster center mean of DBSCAN in R?

我是研究僧i 提交于 2019-11-29 15:11:11
Using dbscan in package fpc I am able to get an output of: dbscan Pts=322 MinPts=20 eps=0.005 0 1 seed 0 233 border 87 2 total 87 235 but I need to find the cluster center (mean of cluster with most seeds). Can anyone show me how to proceed with this? Just index back into the original data using the cluster ID of your choice. Then you can easily do whatever further processing you want to the subset. Here is an example: library(fpc) n = 100 set.seed(12345) data = matrix(rnorm(n*3), nrow=n) data.ds = dbscan(data, 0.5) > data.ds dbscan Pts=100 MinPts=5 eps=0.5 0 1 2 3 seed 0 1 3 1 border 83 4 4 4

Why is Spark Mllib KMeans algorithm extremely slow?

拟墨画扇 提交于 2019-11-29 14:20:47
I'm having the same problem as in this post , but I don't have enough points to add a comment there. My dataset has 1 Million rows, 100 cols. I'm using Mllib KMeans also and it is extremely slow. The job never finishes in fact and I have to kill it. I am running this on Google cloud (dataproc). It runs if I ask for a smaller number of clusters (k=1000), but still take more than 35 minutes. I need it to run for k~5000. I have no idea why is it so slow. The data is properly partitioned given the number of workers/nodes and SVD on a 1 million x ~300,000 col matrix takes ~3 minutes, but when it