data-mining

Architecture for database analytics

三世轮回 提交于 2019-12-02 23:06:34
We have an architecture where we provide each customer Business Intelligence-like services for their website (internet merchant). Now, I need to analyze those data internally (for algorithmic improvement, performance tracking, etc...) and those are potentially quite heavy: we have up to millions of rows / customer / day, and I may want to know how many queries we had in the last month, weekly compared, etc... that is the order of billions entries if not more. The way it is currently done is quite standard: daily scripts which scan the databases, and generate big CSV files. I don't like this

Web scraping, screen scraping, data mining tips? [closed]

会有一股神秘感。 提交于 2019-12-02 21:22:00
I'm working on a project and I need to do a lot of screen scraping to get a lot of data as fast as possible. I'm wondering if anyone knows of any good API's or resources to help me out. I'm using java, by the way. Here's what my workflow has been so far: Connect to a website (using HTTPComponents from Apache) Website contains a section with a bunch of links that I need to visit (using built in java HTML parsers to figure out what all the links I need to visit are, this is annoying and messy code) Visit all the links that I found For each link that I visit, there's more data that I need to

How would you group/cluster these three areas in arrays in python?

自作多情 提交于 2019-12-02 20:35:01
So you have an array 1 2 3 60 70 80 100 220 230 250 For a better understanding: How would you group/cluster the three areas in arrays in python(v2.6), so you get three arrays in this case containing [1 2 3] [60 70 80 100] [220 230 250] Background: y-axis is frequency, x-axis is number. These numbers are the ten highest amplitudes being represented by their frequencies. I want to create three discrete numbers from them for pattern recognition. There could be many more points but all of them are grouped by a relatively big frequency difference as you can see in this example between about 50 and

Better text documents clustering than tf/idf and cosine similarity?

≯℡__Kan透↙ 提交于 2019-12-02 19:21:49
I'm trying to cluster the Twitter stream. I want to put each tweet to a cluster that talk about the same topic. I tried to cluster the stream using an online clustering algorithm with tf/idf and cosine similarity but I found that the results are quite bad. The main disadvantages of using tf/idf is that it clusters documents that are keyword similar so it's only good to identify near identical documents. For example consider the following sentences: 1- The website Stackoverflow is a nice place. 2- Stackoverflow is a website. The prevoiuse two sentences will likely by clustered together with a

Extracting information from AJAX based sites using Python

帅比萌擦擦* 提交于 2019-12-02 18:28:16
问题 I am trying to retrieve query results on sites based on ajax like www.snapbird.org using Python. Since it doesn't show in the page source, I am not sure how to proceed. I am a Python newbie and hence it would be great if I could get a pointer in the right direction. I am also open to some other approach to the task if that is easier 回答1: This is going to be complex but as a start, ppen firebug and find the URL that gets called when the AJAX request is handled. You can call that directly in

What FFT descriptors should be used as feature to implement classification or clustering algorithm?

谁都会走 提交于 2019-12-02 17:46:23
I have some geographical trajectories sampled to analyze, and I calculated the histogram of data in spatial and temporal dimension, which yielded a time domain based feature for each spatial element. I want to perform a discrete FFT to transform the time domain based feature into frequency domain based feature (which I think maybe more robust), and then do some classification or clustering algorithms. But I'm not sure using what descriptor as frequency domain based feature, since there are amplitude spectrum, power spectrum and phase spectrum of a signal and I've read some references but still

Hierarchical clustering of 1 million objects

坚强是说给别人听的谎言 提交于 2019-12-02 15:55:34
Can anyone point me to a hierarchical clustering tool (preferable in python) that can cluster ~1 Million objects? I have tried hcluster and also Orange . hcluster had trouble with 18k objects. Orange was able to cluster 18k objects in seconds, but failed with 100k objects (saturated memory and eventually crashed). I am running on a 64bit Xeon CPU (2.53GHz) and 8GB of RAM + 3GB swap on Ubuntu 11.10. denis To beat O(n^2), you'll have to first reduce your 1M points (documents) to e.g. 1000 piles of 1000 points each, or 100 piles of 10k each, or ... Two possible approaches: build a hierarchical

How can I find the center of a cluster of data points?

拈花ヽ惹草 提交于 2019-12-02 15:10:51
Let's say I plotted the position of a helicopter every day for the past year and came up with the following map: Any human looking at this would be able to tell me that this helicopter is based out of Chicago. How can I find the same result in code? I'm looking for something like this: $geoCodeArray = array([GET=http://pastebin.com/grVsbgL9]); function findHome($geoCodeArray) { // magic return $geoCode; } Ultimately generating something like this: UPDATE: Sample Dataset Here's a map with a sample dataset: http://batchgeo.com/map/c3676fe29985f00e1605cd4f86920179 Here's a pastebin of 150

What makes the distance measure in k-medoid “better” than k-means?

不羁的心 提交于 2019-12-02 14:41:47
I am reading about the difference between k-means clustering and k-medoid clustering. Supposedly there is an advantage to using the pairwise distance measure in the k-medoid algorithm, instead of the more familiar sum of squared Euclidean distance-type metric to evaluate variance that we find with k-means. And apparently this different distance metric somehow reduces noise and outliers. I have seen this claim but I have yet to see any good reasoning as to the mathematics behind this claim. What makes the pairwise distance measure commonly used in k-medoid better? More exactly, how does the

R Random Forests Variable Importance

烂漫一生 提交于 2019-12-02 13:53:45
I am trying to use the random forests package for classification in R. The Variable Importance Measures listed are: mean raw importance score of variable x for class 0 mean raw importance score of variable x for class 1 MeanDecreaseAccuracy MeanDecreaseGini Now I know what these "mean" as in I know their definitions. What I want to know is how to use them. What I really want to know is what these values mean in only the context of how accurate they are, what is a good value, what is a bad value, what are the maximums and minimums, etc. If a variable has a high MeanDecreaseAccuracy or