Does Mahout provide a way to determine similarity between content (for content-based recommendations)?

后端未结

关注

 1  1739

深忆病人 2021-01-03 15:30

Does Mahout provide a way to determine similarity between content?

I would like to produce content-based recommendations as part of a web application. I know Mahout

1条回答

庸人自扰 (楼主)

2021-01-03 15:59

That is not entirely true. Mahout does not have content-based recommender, but it does have algorithms for computing similarities between items based on the content. One of the most popular one is TF-IDF and cosine similarity. However, the computation is not on the fly, but is done offline. You need hadoop to compute the pairwise similarities based on the content more faster. The steps I am going to write are for MAHOUT 0.8. I am not sure if they changed it in 0.9.

Step 1. You need to convert your text documents into seq files. I lost the command for this in MAHOUT-0.8, but in 0.9 is something like this (Please check it for your version of MAHOUT):

$MAHOUT_HOME/bin/mahout seqdirectory
--input  --output 
<-c  {UTF-8|cp1252|ascii...}>
<-chunk  64>
<-prefix >

Step 2. You need to convert your sequence files into sparse vectors like this:

$MAHOUT_HOME/bin/mahout seq2sparse \
   -i  \
   -o  \
   -ow -chunk 100 \
   -wt tfidf \
   -x 90 \
   -seq \
   -ml 50 \
   -md 3 \
   -n 2 \
   -nv \
   -Dmapred.map.tasks=1000 -Dmapred.reduce.tasks=1000

where:

chunk is the size of the of the file.
x the maximum number of the term should occur in to be considered a part of the dictionary file. If it occurs less than -x, then it is considered as a stop word.
wt is weighting scheme.
md The minimum number of documents the term should occur in to be considered a part of the dictionary file. Any term with lesser frequency is ignored.
n The normalization value to use in the Lp space. A detailed explanation of normaliza- tion is given in section 8.4. The default scheme is to not normalize the weights. 2 is good for cosine distance, which we are using in clustering and for similarity
nv to get named vectors making further data files easier to inspect.

Step 3. Create a matrix from the vectors:

$MAHOUT_HOME/bin/mahout rowid -i /tfidf-vectors/part-r-00000 -o

Step 4. Create a collection of similar docs for each row of the matrix above. This will generate the 50 most similar docs to each doc in the collection.

 $MAHOUT_HOME/bin/mahout rowsimilarity -i /matrix -o  -r  --similarityClassname SIMILARITY_COSINE -m 50 -ess -Dmapred.map.tasks=1000 -Dmapred.reduce.tasks=1000

This will produce a file with similarities between each item with the top 50 files based on the content.

Now, to use it in your recommendation process you need to read the file or load it into database, depending of how much resources you have. I loaded into main memory using Collection. Here are two simple functions that did the job for me:

public static Collection correlationMatrix(final File folder, TIntLongHashMap docIndex) throws IOException{
        Collection corrMatrix = 
                new ArrayList();

        ItemItemSimilarity itemItemCorrelation = null;

        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(conf);

        int n=0;
        for (final File fileEntry : folder.listFiles()) {
            if (fileEntry.isFile()) {
                if(fileEntry.getName().startsWith("part-r")){

                    SequenceFile.Reader reader = new SequenceFile.Reader(fs, new Path(fileEntry.getAbsolutePath()), conf);

                    IntWritable key = new IntWritable();
                    VectorWritable value = new VectorWritable();
                    while (reader.next(key, value)) {

                        long itemID1 = docIndex.get(Integer.parseInt(key.toString()));

                        Iterator it = value.get().nonZeroes().iterator();

                        while(it.hasNext()){
                            Element next = it.next();
                            long itemID2 =  docIndex.get(next.index());
                            double similarity =  next.get();
                            //System.out.println(itemID1+ " : "+itemID2+" : "+similarity);

                            if (similarity < -1.0) {
                                similarity = -1.0;
                            } else if (similarity > 1.0) {
                                similarity = 1.0;
                            }


                            itemItemCorrelation = new GenericItemSimilarity.ItemItemSimilarity(itemID1, itemID2, similarity);

                            corrMatrix.add(itemItemCorrelation);
                        }
                    }
                    reader.close();
                    n++;
                    logger.info("File "+fileEntry.getName()+" readed ("+n+"/"+folder.listFiles().length+")");
                }
            }
        }

        return corrMatrix;
    }


public static TIntLongHashMap getDocIndex(String docIndex) throws IOException{
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(conf);

        TIntLongHashMap map = new TIntLongHashMap();
        SequenceFile.Reader docIndexReader = new SequenceFile.Reader(fs, new Path(docIndex), conf);

        IntWritable key = new IntWritable();
        Text value = new Text();
        while (docIndexReader.next(key, value)) {
            map.put(key.get(), Long.parseLong(value.toString()));
        }

        return map;
    }

At the end, in your recommendation class you call this:

TIntLongHashMap docIndex = ItemPairwiseSimilarityUtil.getDocIndex(filename);
TLongObjectHashMap correlationMatrix = ItemPairwiseSimilarityUtil.correlatedItems(folder, docIndex);

Where filename is your docIndex filename, and folder is the folder of the item-similarity files. At the end, this is nothing more than item-item based recommendation.

Hope this can help you

0 讨论(0)