data-mining | 易学教程

How does clustering (especially String clustering) work?

阅读更多关于 How does clustering (especially String clustering) work?

I heard about clustering to group similar data. I want to know how it works in the specific case for String. I have a table with more than different 100,000 words. I want to identify the same word with some differences (eg.: house, house!!, hooouse, HoUse, @house, "house", etc... ). What is needed to identify the similarity and group each word in a cluster? What algorithm is more recommended for this? To understand what clustering is imagine a geographical map. You can see many distinct objects (such as houses). Some of them are close to each other, and others are far. Based on this, you can

How does the Amazon Recommendation feature work?

阅读更多关于 How does the Amazon Recommendation feature work?

What technology goes in behind the screens of Amazon recommendation technology? I believe that Amazon recommendation is currently the best in the market, but how do they provide us with such relevant recommendations? Recently, we have been involved with similar recommendation kind of project, but would surely like to know about the in and outs of the Amazon recommendation technology from a technical standpoint. Any inputs would be highly appreciated. Update: This patent explains how personalized recommendations are done but it is not very technical, and so it would be really nice if some

hclust size limit?

阅读更多关于 hclust size limit?

问题 I'm new to R. I'm trying to run hclust() on about 50K items. I have 10 columns to compare and 50K rows of data. When I tried assigning the distance matrix, I get: "Cannot allocate vector of 5GB". Is there a size limit to this? If so, how do I go about doing a cluster of something this large? EDIT I ended up increasing the max.limit and increased the machine's memory to 8GB and that seems to have fixed it. 回答1: Classic hierarchical clustering approaches are O(n^3) in runtime and O(n^2) in

K-Means Algorithm [duplicate]

阅读更多关于 K-Means Algorithm [duplicate]

问题 This question already has answers here : Closed 8 years ago . Possible Duplicates: How to optimal K in K - Means Algorithm How do I determine k when using k-means clustering? Depending on the statistical measures can we decide on the K. Like Standard Deviation, Mean, Variance etc., Or Is there any simple method to choose the K in K-means Algorithm? Thanks in advance Navin 回答1: If you explicitly want to use k-means you could study the article describing x-means. When using an implementation of

find all two word phrases that appear in more than one row in a dataset

阅读更多关于 find all two word phrases that appear in more than one row in a dataset

问题 We would like to run a query that returns two word phrases that appear in more than one row. So for e.g. take the string "Data Ninja". Since it appears in more than one row in our dataset, the query should return that. The query should find all such phrases from all the rows in our dataset, by querying for two adjacent word combination (forming a phrase) in the rows that are in the dataset. These two adjacent word combinations should come from the dataset we loaded into BigQuery How can we

Scraping a webpage with C# and HTMLAgility

阅读更多关于 Scraping a webpage with C# and HTMLAgility

I have read that HTMLAgility 1.4 is a great solution to scraping a webpage. Being a new programmer I am hoping I could get some input on this project. I am doing this as a c# application form. The page I am working with is fairly straight forward. The information I need is stuck between just 2 tags and . My goal is to pull the data for Part-Num, Manu-Number, Description, Manu-Country, Last Modified, Last Modified By out of the page and send the data to a sql table. One twist is that there is also a small png pic that also need to be grabbed from the src="/partcode/number. I do not have any

What information can we access from the client? [closed]

阅读更多关于 What information can we access from the client? [closed]

问题 I'm trying to compile a list of information that is accessible via javascript such as: Geo-location IP address Browser software Exit location Entrance location I understand that a user can alter any of this information and that it's reliability is purely trust related, but I am still interested in what other information can be mined from the client. 回答1: Don't forget about Screen Size Allowed Cookies Allowed Java Mobile or Desktop Language And here is useful link with data-mining demo: http:/

Clustering values by their proximity in python (machine learning?) [duplicate]

阅读更多关于 Clustering values by their proximity in python (machine learning?) [duplicate]

This question already has an answer here: Cluster one-dimensional data optimally? [closed] 1 answer 1D Number Array Clustering [duplicate] 2 answers I have an algorithm that is running on a set of objects. This algorithm produces a score value that dictates the differences between the elements in the set. The sorted output is something like this: [1,1,5,6,1,5,10,22,23,23,50,51,51,52,100,112,130,500,512,600,12000,12230] If you lay these values down on a spreadsheet you see that they make up groups [1,1,5,6,1,5] [10,22,23,23] [50,51,51,52] [100,112,130] [500,512,600] [12000,12230] Is there a way

Difference between classification and clustering in data mining? [closed]

阅读更多关于 Difference between classification and clustering in data mining? [closed]

Can someone explain what the difference is between classification and clustering in data mining? If you can, please give examples of both to understand the main idea. In general, in classification you have a set of predefined classes and want to know which class a new object belongs to. Clustering tries to group a set of objects and find whether there is some relationship between the objects. In the context of machine learning, classification is supervised learning and clustering is unsupervised learning . Also have a look at Classification and Clustering at Wikipedia. Please read the

How to approach a number guessing game (with a twist) algorithm?

阅读更多关于 How to approach a number guessing game (with a twist) algorithm?

问题 I am learning programming (Python and algorithms) and was trying to work on a project that I find interesting. I have created a few basic Python scripts, but I’m not sure how to approach a solution to a game I am trying to build. Here’s how the game will work: Users will be given items with a value. For example, Apple = 1 Pears = 2 Oranges = 3 They will then get a chance to choose any combo of them they like (i.e. 100 apples, 20 pears, and one orange). The only output the computer gets is the