data-mining | 易学教程

URL path similarity/string similarity algorithm

阅读更多关于 URL path similarity/string similarity algorithm

My problem is that I need to compare URL paths and deduce if they are similar. Below I provide example data to process: # GROUP 1 /robots.txt # GROUP 2 /bot.html # GROUP 3 /phpMyAdmin-2.5.6-rc1/scripts/setup.php /phpMyAdmin-2.5.6-rc2/scripts/setup.php /phpMyAdmin-2.5.6/scripts/setup.php /phpMyAdmin-2.5.7-pl1/scripts/setup.php /phpMyAdmin-2.5.7/scripts/setup.php /phpMyAdmin-2.6.0-alpha/scripts/setup.php /phpMyAdmin-2.6.0-alpha2/scripts/setup.php # GROUP 4 //phpMyAdmin/ I tried Levenshtein distance to compare, but for me is not enough accurate. I do not need 100% accurate algorithm, but I think

TFIDF calculating confusion

阅读更多关于 TFIDF calculating confusion

问题 I found the following code on the internet for calculating TFIDF: https://github.com/timtrueman/tf-idf/blob/master/tf-idf.py I added "1+" in the function def idf(word, documentList) so i won't get divided by 0 error: return math.log(len(documentList) / (1 + float(numDocsContaining(word,documentList)))) But i am confused for two things: I get negative values in some cases, is this correct? I am confused with line 62, 63 and 64. Code: documentNumber = 0 for word in documentList[documentNumber]

how to get all terminal nodes - weight & response prediction 'ctree' in r

阅读更多关于 how to get all terminal nodes - weight & response prediction 'ctree' in r

Here's what I can use to list weight for all terminal nodes : but how can I add some code to get response prediction as well as weight by each terminal node ID : say I want my output to look like this -- Here below is what I have so far to get the weight nodes(airct, unique(where(airct))) Thank you The Binary tree is a big S4 object, so sometimes it is difficult to extract the data. But the plot method for BinaryTree object, hase an optional panel function of the form function(node) plotting the terminal nodes. So when you plot you can get all the informations about this node. here I use the

Find HEX patterns and number of occurrences

阅读更多关于 Find HEX patterns and number of occurrences

I'd like to find patterns and sort them by number of occurrences on an HEX file I have. I am not looking for some specific pattern, just to make some statistics of the occurrences happening there and sort them.

Web mining -classification algorithms

阅读更多关于 Web mining -classification algorithms

my senior project is determining the dominant category of a web page.I crawled dmoz. now i am trying to build arff. After that i will use some feature extraction methods and classification algorithms. Do you know which feature extraction method performs good with any classification algorithm for web mining? uClassify uses Bayesian Networks and claims to be able to categorize web pages. uClassify is a free web service where you can easily create your own text classifiers. Examples: Spam filter Web page categorization Automatic e-mail support Language detection Written text gender recognition

Hierarchical clusterization heuristics

阅读更多关于 Hierarchical clusterization heuristics

问题 I want to explore relations between data items in large array. Every data item represented by multidimensional vector. First of all, I've decided to use clusterization. I'm interested in finding hierarchical relations between clusters (groups of data vectors). I'm able to calculate distance between my vectors. So at the first step I'm finding minimal spanning tree . After that I need to group data vectors according to links in my spanning tree. But at this step I'm disturbed - how to combine

How can HMMs be used for handwriting recognition?

阅读更多关于 How can HMMs be used for handwriting recognition?

问题 The problem is a bit different than traditional handwriting recognition. I have a dataset that are thousands of the following. For one drawn character, I have several sequential (x, y) coordinates where the pen was pressed down. So, this is a sequential (temporal) problem. I want to be able to classify handwritten characters based on this data, and would love to implement HMMs for learning purposes. But, is this the right approach? How can they be used to do this? 回答1: I think HMM can be used

Algorithm for clustering people with similar interests

阅读更多关于 Algorithm for clustering people with similar interests

I want to cluster people into groups based on their interests. For eg. people who like machine learning and graphs may be placed in a group and people who have interest in mathematics and economics etc. may be placed in a different group. The algorithm should be able to decide which people have most matching interests based on the interests of the people and create clusters.It should also be able to output about other persons in the group in which a particular person is placed. This does not sound like a particularly difficult clustering problem, and any of the off-the-shelf clustering

Historical weather data from NOAA

阅读更多关于 Historical weather data from NOAA

问题 I am working on a data mining project and I would like to gather historical weather data. I am able to get historical data through the web interface that they provide at http://www.ncdc.noaa.gov/cdo-web/search. But I would like to access this data programmatically through an API. From what I have been reading on StackOverflow this data is supposed to be public domain, but the only place I have been able to find it is on non-free services like Wunderground. How can I access this data for free?

What does dimensionality reduction mean?

阅读更多关于 What does dimensionality reduction mean?

问题 What does dimensionality reduction mean exactly? I searched for its meaning, I just found that it means the transformation of raw data into a more useful form. So what is the benefit of having data in useful form, I mean how can I use it in a practical life (application)? 回答1: Dimensionality Reduction is about converting data of very high dimensionality into data of much lower dimensionality such that each of the lower dimensions convey much more information. This is typically done while