data-mining | 易学教程

Fuzzy c-means tcp dump clustering in matlab

阅读更多关于 Fuzzy c-means tcp dump clustering in matlab

问题 Hi I have some data thats represented like this: 0,tcp,http,SF,239,486,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,19,19,1.00,0.00,0.05,0.00,0.00,0.00,0.00,0.00,normal. Its from the kdd cup 1999 which was based on the darpa set. the text file I have has rows and rows of data like this, in matlab there is the generic clustering tool you can use by typing findcluster but it only accepts .dat files. Im also not very sure if it will accept the format like this. Im also

Intelligently grab first paragraph/starting text

阅读更多关于 Intelligently grab first paragraph/starting text

问题 I'd like to have a script where I can input a URL and it will intelligently grab the first paragraph of the article... I'm not sure where to begin other than just pulling text from within <p> tags. Do you know of any tips/tutorials on how to do this kind of thing? update For further clarification, I'm building a section of my site where users can submit links like on Facebook, it'll grab an image from their site as well as text to go with the link. I'm using PHP and trying to determine the

How to handle huge sparse matrices construction using Scipy?

阅读更多关于 How to handle huge sparse matrices construction using Scipy?

问题 So, I am working on a Wikipedia dump to compute the pageranks of around 5,700,000 pages give or take. The files are preprocessed and hence are not in XML. They are taken from http://haselgrove.id.au/wikipedia.htm and the format is: from_page(1): to(12) to(13) to(14).. from_page(2): to(21) to(22).. . . . from_page(5,700,000): to(xy) to(xz) so on. So. basically it's a construction of a [5,700,000*5,700,000] matrix, which would just break my 4 gigs of RAM. Since, it is very-very Sparse, that

cspade() R Error

阅读更多关于 cspade() R Error

问题 I am trying to mine rules from the events of cable modems. Linked is one file of thousands. When I try and run the cspade algorithm on the merged file of all devices (12 million rows) it spends hours chewing through RAM until it uses all 64 GB I have available. So I attempted to run the algorithm on the linked file for just one device. I see the exact same thing happen. Since this sub sample is only 2190 rows I thought this was strange. Can someone explain why Im not seeing results in a

Techniques to display related content or articles

阅读更多关于 Techniques to display related content or articles

问题 I've been trying to learn Text mining and other related things in Collective Intelligence field. I am interested to make an app which will scan thru the document and show related posts/articles on page. What algorithm(s) would be helpful to retrieve required info? Thanks /A 回答1: A simple method is to count the non-common words and their instances on the page. The more a word shows up, the better it is at describing the content of the post. You can then use it to look up other articles/posts.

Persisting data in sklearn

阅读更多关于 Persisting data in sklearn

问题 I'm using scikit-learn to cluster text documents. I'm using the classes CountVectorizer, TfidfTransformer and MiniBatchKMeans to help me do that. New text documents are added to the system all the time, which means that I need to use the classes above to transform the text and predict a cluster. My question is: how should I store the data on disk? Should I simply pickle the vectorizer, transformer and kmeans objects? Should I just save the data? If so, how do I add it back to the vectorizer,

Search twitter and obtain tweets by hashtag, maximizing number of returned search results

阅读更多关于 Search twitter and obtain tweets by hashtag, maximizing number of returned search results

问题 I am attempting to compile a corpus of all Tweets related to the World Cup on Twitter from their API using the twitteR package in R. I am using the following code for a single hashtag (for example). However, my problem is that it appears I am only 'authorized' to access a limited set of the tweets (in this case, only the 32 most recent). library(twitteR) reqURL <- "https://api.twitter.com/oauth/request_token" accessURL <- "https://api.twitter.com/oauth/access_token" authURL <- "http://api

How to get several columns from BigQuery?

阅读更多关于 How to get several columns from BigQuery?

I am querying the github public dataset on BigQuery. Currently, my best query for what I need looks like the following. SELECT type, created_at, repository_name FROM [githubarchive:github.timeline] WHERE (created_at CONTAINS '2012-') AND repository_owner="twitter" ORDER BY created_at, repository_name; This gives me all the events ("type") from the repository_owner twitter (or any other user) for all the repositories ("repository_name") that this user owns, but in a single column. However, what I really want is to have all the events ("type") in columns, one column for each repository (

Calculate similarity between list of words

阅读更多关于 Calculate similarity between list of words

问题 I want to calculate the similarity between two list of words, for example : ['email','user','this','email','address','customer'] is similar to this list: ['email','mail','address','netmail'] I want to have a higher percentage of similarity than another list, for example: ['address','ip','network'] even if address exists in the list. 回答1: Since you haven't really been able to demonstrate a crystal output, here is my best shot: list_A = ['email','user','this','email','address','customer'] list

Data Sets For Data Mining Tasks [closed]

阅读更多关于 Data Sets For Data Mining Tasks [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 6 years ago . I am relatively new in the field of Data Mining. I am currently doing Some Data preprocessing algorithms such as PCA and min max Normalization . Our professor said we could download the data sets available over the web. But at initial level I want a simple data set with relatively small number of attributes for