data-mining

Decision tree vs. Naive Bayes classifier [closed]

我只是一个虾纸丫 提交于 2019-12-02 13:52:39
I am doing some research about different data mining techniques and came across something that I could not figure out. If any one have any idea that would be great. In which cases is it better to use a Decision tree and other cases a Naive Bayes classifier? Why use one of them in certain cases? And the other in different cases? (By looking at its functionality, not at the algorithm) Anyone have some explanations or references about this? Decision Trees are very flexible, easy to understand, and easy to debug. They will work with classification problems and regression problems. So if you are

why two vectors is not similarity but result is 1?

只谈情不闲聊 提交于 2019-12-02 13:29:46
I'm using Cosine Similarity formula to caculate similarity between two vectors. I tried two different vectors like this: Vector1(-1237373741, 27, 1, 1, 331289590, 1818540802) Vector2(-1237373741, 49, 1, 1, 331289590, 1818540802) Two vectors has a little different, but the result is 1 . I don't know why? Anyone can explain this problem for me? thanks so much. For the most part, those two vectors are are pointing in the same direction (The larger coordinates are going to dominate the smaller differences in the other coordinate). A cosine similarity of ~1 is expected (Remember that cos(0) = 1) 来源

Extracting information from AJAX based sites using Python

♀尐吖头ヾ 提交于 2019-12-02 10:43:43
I am trying to retrieve query results on sites based on ajax like www.snapbird.org using Python. Since it doesn't show in the page source, I am not sure how to proceed. I am a Python newbie and hence it would be great if I could get a pointer in the right direction. I am also open to some other approach to the task if that is easier This is going to be complex but as a start, ppen firebug and find the URL that gets called when the AJAX request is handled. You can call that directly in your Python program and parse the output. You could use Selenium's Python client driver to parse the page

Information Gain Calculation for a text file?

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-02 06:57:31
I'm working on "text categorization using Information gain,PCA and Genetic Algorithm" But after performing Preprocessing (Stemming, stopword removal, TFIDF) on the document m confused how to move ahead for information gain part. my out file contain word and there TFIDF value. like WORD - TFIDF VALUE together(word) - 0.235(tfidf value) come(word) - 0.2548(tfidf value) when using weka for information gain (" InfoGainAttributeEval.java ") it require .arff file format as input. Is there any to convert text file into .arff format. or any other way to preform Information gain other than weka? Is

Using ELKI's Distance Function

折月煮酒 提交于 2019-12-02 06:35:56
问题 This is a follow up from a previous question, where we commented that using euclidian distances with lat,long coordinates does not yeld correct results. I read in the documentation that ELKI enables geographic data, namely int its distance function, present in the various clustering algorithms. In the user interface of ELKI, I can see there are options to replace the default distance function (euclidian) by a better suited one. I also see that in that case, you need to provide a datum, which

Using a Geo Distance Function on ELKI

我的梦境 提交于 2019-12-02 01:20:33
I am using ELKI to mine some geospatial data (lat,long pairs) and I am quite concerned on using the right data types and algorithms. On the parameterizer of my algorithm, I tried to change the default distance function by a geo function (LngLatDistanceFunction, as I am using x,y data) as bellow: params.addParameter (DISTANCE_FUNCTION_ID, geo.LngLatDistanceFunction.class); However the results are quite surprising: it creates clusters of a repeated point, such as the example bellow: (2.17199922, 41.38190043, NaN), (2.17199922, 41.38190043, NaN), (2.17199922, 41.38190043, NaN), (2.17199922, 41

mlpy - Dynamic Time Warping depends on x?

我与影子孤独终老i 提交于 2019-12-01 20:24:28
问题 I am trying to get the distance between these two arrays shown below by DTW. I am using the Python mlpy package that offers dist, cost, path = mlpy.dtw_std(y1, y2, dist_only=False) I understand that DTW does take care of the "shifting". In addition, as can be seen from above, the mlpy.dtw_std() only takes in 2 1-D arrays. So I expect that no matter how I left/right shift my curves, the dist returned by the function should never change. However after shifting my green curve a bit to the right,

mlpy - Dynamic Time Warping depends on x?

爱⌒轻易说出口 提交于 2019-12-01 19:52:20
I am trying to get the distance between these two arrays shown below by DTW. I am using the Python mlpy package that offers dist, cost, path = mlpy.dtw_std(y1, y2, dist_only=False) I understand that DTW does take care of the "shifting". In addition, as can be seen from above, the mlpy.dtw_std() only takes in 2 1-D arrays. So I expect that no matter how I left/right shift my curves, the dist returned by the function should never change. However after shifting my green curve a bit to the right, the dist returned by mlpy.dtw_std() changes! Before shifting: Python mlpy.dwt_std reports dist = 14

Java+Redis vs plain Java efficiency for data intensive applications?

南笙酒味 提交于 2019-12-01 18:33:12
问题 Does it help to use Redis with Java to develop data intensive applications (e.g. data-mining) in Java? Does it work faster or consume less memory comparing to plain Java for similar operation on high volume of data? Edit: My question is mostly about running on single machine. For example for working with a large number of list/set/maps and query and sort them. 回答1: Redis will definitely not be faster that native Java on a single machine. It would allow you to distribute processing, but if the

Java+Redis vs plain Java efficiency for data intensive applications?

拥有回忆 提交于 2019-12-01 18:07:05
Does it help to use Redis with Java to develop data intensive applications (e.g. data-mining) in Java? Does it work faster or consume less memory comparing to plain Java for similar operation on high volume of data? Edit: My question is mostly about running on single machine. For example for working with a large number of list/set/maps and query and sort them. Redis will definitely not be faster that native Java on a single machine. It would allow you to distribute processing, but if the chunks of data really are large, they're not likely to fit into memory anyway. Without knowing more about