How to read Mahout clustering output

扶醉桌前 提交于 2021-02-18 15:09:34

问题


I have run the k-Means clustering algorithm on the synthetic control data from the Mahout tutorial, and was wondering if someone could explain how to interpret the output. I ran clusterdump and received output that looks something like this (truncated to save space):

CL-592{n=57 c=30.726, 29.813...] r=[3.528, 3.597...]}
Weight : [props - optional]: Point:
1.0 : [distance=27.453962995925863]: [24.672, 35.261, 30.486...]
1.0 : [distance=27.675053294846002]: [25.592, 29.951, 34.188...]
1.0 : [distance=28.97727289419493]: [30.696, 32.667, 34.223...]
1.0 : [distance=21.999685652862784]: [32.702, 35.219, 30.143...]
...
CL-598{n=50 c=[29.611, 29.769...] r=[3.166, 3.561...]}
Weight : [props - optional]:  Point:
1.0 : [distance=27.266203490250472]: [27.679, 33.506, 23.594...]
1.0 : [distance=28.749781351838173]: [34.727, 28.325, 30.331...]
1.0 : [distance=32.635136046420186]: [27.758, 33.859, 29.879...]
1.0 : [distance=29.328974057024624]: [29.356, 26.793, 25.575...]

Could someone explain to me how to read this? From what I understand, CL-__ is a cluster ID, followed by n=number of points in the cluster, c=centroid as a vector, r=radius as a vector, and then each point in the cluster. Is this correct? Furthermore, how do I know which clustered point matches up with which input point? i.e. are the points described as a key-value pair where the key is some kind of ID for the point and the value is the vector? If not is there some way I can set it up so it is?


回答1:


I believe your interpretation of the data is correct (I've only been working with Mahout for ~3 weeks, so someone more seasoned should probably weigh in on this).

As far as linking points back to the input that created them I've used NamedVector, where the name is the key for the vector. When you read one of the generated points files (clusteredPoints) you can convert each row (point vector) back into a NamedVector and retrieve the name using .getName().

Update in response to comment

When you initially read your data into Mahout, you convert it into a collection of vectors with which you then write to a file (points) for use in the clustering algorithms later. Mahout gives you several Vector types which you can use, but they also give you access to a Vector wrapper class called NamedVector which will allow you to identify each vector.

For example, you could create each NamedVector as follows:

NamedVector nVec = new NamedVector(
    new SequentialAccessSparseVector(vectorDimensions), 
    vectorName
    );

Then you write your collection of NamedVectors to file with something like:

SequenceFile.Writer writer = new SequenceFile.Writer(...);
VectorWritable writable = new VectorWritable();

// the next two lines will be in a loop, but I'm omitting it for clarity
writable.set(nVec);
writer.append(new Text(nVec.getName()), nVec);

You can now use this file as input to one of the clustering algorithms.

After having run one of the clustering algorithms with your points file, it will have generated yet another points file, but it will be in a directory named clusteredPoints.

You can then read in this points file and extract the name you associated to each vector. It'll look something like this:

IntWritable clusterId = new IntWritable();
WeightedPropertyVectorWritable vector = new WeightedPropertyVectorWritable();

while (reader.next(clusterId, vector))
{
    NamedVector nVec = (NamedVector)vector.getVector();
    // you now have access to the original name using nVec.getName()
}



回答2:


Try to add the option -of CSV in clusterdump, you will have a more exploitable result for further treatment.




回答3:


I have the same problem,(using mahout 0.6).I am also a beginner. I need to display the documents in the form of clusters to the users. So i will need document names rather that words corresponding to clusters. I have been clustering the text documents from shell script.



来源:https://stackoverflow.com/questions/11848038/how-to-read-mahout-clustering-output

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!