data-mining

Finding 2 & 3 word Phrases Using R TM Package

风流意气都作罢 提交于 2019-11-27 00:10:17
I am trying to find a code that actually works to find the most frequently used two and three word phrases in R text mining package (maybe there is another package for it that I do not know). I have been trying to use the tokenizer, but seem to have no luck. If you worked on a similar situation in the past, could you post a code that is tested and actually works? Thank you so much! You can pass in a custom tokenizing function to tm 's DocumentTermMatrix function, so if you have package tau installed it's fairly straightforward. library(tm); library(tau); tokenize_ngrams <- function(x, n=3)

How does the Amazon Recommendation feature work?

我的梦境 提交于 2019-11-26 23:44:29
问题 What technology goes in behind the screens of Amazon recommendation technology? I believe that Amazon recommendation is currently the best in the market, but how do they provide us with such relevant recommendations? Recently, we have been involved with similar recommendation kind of project, but would surely like to know about the in and outs of the Amazon recommendation technology from a technical standpoint. Any inputs would be highly appreciated. Update: This patent explains how

Scraping a webpage with C# and HTMLAgility

﹥>﹥吖頭↗ 提交于 2019-11-26 21:27:05
问题 I have read that HTMLAgility 1.4 is a great solution to scraping a webpage. Being a new programmer I am hoping I could get some input on this project. I am doing this as a c# application form. The page I am working with is fairly straight forward. The information I need is stuck between just 2 tags and . My goal is to pull the data for Part-Num, Manu-Number, Description, Manu-Country, Last Modified, Last Modified By out of the page and send the data to a sql table. One twist is that there is

Can someone give an example of cosine similarity, in a very simple, graphical way?

梦想与她 提交于 2019-11-26 19:16:19
Cosine Similarity article on Wikipedia Can you show the vectors here (in a list or something) and then do the math, and let us see how it works? I'm a beginner. Here are two very short texts to compare: Julie loves me more than Linda loves me Jane likes me more than Julie loves me We want to know how similar these texts are, purely in terms of word counts (and ignoring word order). We begin by making a list of the words from both texts: me Julie loves Linda than more likes Jane Now we count the number of times each of these words appears in each text: me 2 2 Jane 0 1 Julie 1 1 Linda 1 0 likes

scikit-learn DBSCAN memory usage

£可爱£侵袭症+ 提交于 2019-11-26 18:39:37
UPDATED: In the end, the solution I opted to use for clustering my large dataset was one suggested by Anony-Mousse below. That is, using ELKI's DBSCAN implimentation to do my clustering rather than scikit-learn's. It can be run from the command line and with proper indexing, performs this task within a few hours. Use the GUI and small sample datasets to work out the options you want to use and then go to town. Worth looking into. Anywho, read on for a description of my original problem and some interesting discussion. I have a dataset with ~2.5 million samples, each with 35 features (floating

Speed-efficient classification in Matlab

走远了吗. 提交于 2019-11-26 17:51:43
I have an image of size as RGB uint8(576,720,3) where I want to classify each pixel to a set of colors. I have transformed using rgb2lab from RGB to LAB space, and then removed the L layer so it is now a double(576,720,2) consisting of AB. Now, I want to classify this to some colors that I have trained on another image, and calculated their respective AB-representations as: Cluster 1: -17.7903 -13.1170 Cluster 2: -30.1957 40.3520 Cluster 3: -4.4608 47.2543 Cluster 4: 46.3738 36.5225 Cluster 5: 43.3134 -17.6443 Cluster 6: -0.9003 1.4042 Cluster 7: 7.3884 11.5584 Now, in order to classify/label

Making very specific time requests (to the second) on Twitter API, using Python Tweepy?

天大地大妈咪最大 提交于 2019-11-26 17:07:32
问题 I would like to request tweets on a specific topic (for example: "cancer"), using Python Tweepy. But usually its time can only be specified by a specific day, for example. startSince = '2014-10-01' endUntil = '2014-10-02' for tweet in tweepy.Cursor(api.search, q="cancer", since=startSince, until=endUntil).items(999999999): Is there a way to specify the time so I can collect "cancer" tweets between 2014-10-01 00:00:00 and 2014-10-02 12:00:00? This is for my academic research: I was able to

1D Number Array Clustering [duplicate]

≡放荡痞女 提交于 2019-11-26 15:48:01
Possible Duplicate: Cluster one-dimensional data optimally? So let's say I have an array like this: [1,1,2,3,10,11,13,67,71] Is there a convenient way to partition the array into something like this? [[1,1,2,3],[10,11,13],[67,71]] I looked through similar questions yet most people suggested using k-means to cluster points, like scipy , which is quite confusing to use for a beginner like me. Also I think that k-means is more suitable for two or more dimensional clustering right? Are there any ways to partition an array of N numbers to many partitions/clustering depending on the numbers? Some

Mixing categorial and continuous data in Naive Bayes classifier using scikit-learn

a 夏天 提交于 2019-11-26 15:11:49
问题 I'm using scikit-learn in Python to develop a classification algorithm to predict the gender of certain customers. Amongst others, I want to use the Naive Bayes classifier but my problem is that I have a mix of categorical data (ex: "Registered online", "Accepts email notifications" etc) and continuous data (ex: "Age", "Length of membership" etc). I haven't used scikit much before but I suppose that that Gaussian Naive Bayes is suitable for continuous data and that Bernoulli Naive Bayes can

Kmeans without knowing the number of clusters? [duplicate]

这一生的挚爱 提交于 2019-11-26 10:25:21
问题 This question already has an answer here: How do I determine k when using k-means clustering? 17 answers I am attempting to apply k-means on a set of high-dimensional data points (about 50 dimensions) and was wondering if there are any implementations that find the optimal number of clusters. I remember reading somewhere that the way an algorithm generally does this is such that the inter-cluster distance is maximized and intra-cluster distance is minimized but I don\'t remember where I saw