Does Mahout provide a way to determine similarity between content?
I would like to produce content-based recommendations as part of a web application. I know Mahout
That is not entirely true. Mahout does not have content-based recommender, but it does have algorithms for computing similarities between items based on the content. One of the most popular one is TF-IDF and cosine similarity. However, the computation is not on the fly, but is done offline. You need hadoop to compute the pairwise similarities based on the content more faster. The steps I am going to write are for MAHOUT 0.8. I am not sure if they changed it in 0.9.
Step 1. You need to convert your text documents into seq files. I lost the command for this in MAHOUT-0.8, but in 0.9 is something like this (Please check it for your version of MAHOUT):
$MAHOUT_HOME/bin/mahout seqdirectory
--input --output
Step 2. You need to convert your sequence files into sparse vectors like this:
$MAHOUT_HOME/bin/mahout seq2sparse \
-i \
-o \
-ow -chunk 100 \
-wt tfidf \
-x 90 \
-seq \
-ml 50 \
-md 3 \
-n 2 \
-nv \
-Dmapred.map.tasks=1000 -Dmapred.reduce.tasks=1000
where:
Step 3. Create a matrix from the vectors:
$MAHOUT_HOME/bin/mahout rowid -i /tfidf-vectors/part-r-00000 -o
Step 4. Create a collection of similar docs for each row of the matrix above. This will generate the 50 most similar docs to each doc in the collection.
$MAHOUT_HOME/bin/mahout rowsimilarity -i /matrix -o -r --similarityClassname SIMILARITY_COSINE -m 50 -ess -Dmapred.map.tasks=1000 -Dmapred.reduce.tasks=1000
This will produce a file with similarities between each item with the top 50 files based on the content.
Now, to use it in your recommendation process you need to read the file or load it into database, depending of how much resources you have. I loaded into main memory using Collection. Here are two simple functions that did the job for me:
public static Collection correlationMatrix(final File folder, TIntLongHashMap docIndex) throws IOException{
Collection corrMatrix =
new ArrayList();
ItemItemSimilarity itemItemCorrelation = null;
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
int n=0;
for (final File fileEntry : folder.listFiles()) {
if (fileEntry.isFile()) {
if(fileEntry.getName().startsWith("part-r")){
SequenceFile.Reader reader = new SequenceFile.Reader(fs, new Path(fileEntry.getAbsolutePath()), conf);
IntWritable key = new IntWritable();
VectorWritable value = new VectorWritable();
while (reader.next(key, value)) {
long itemID1 = docIndex.get(Integer.parseInt(key.toString()));
Iterator it = value.get().nonZeroes().iterator();
while(it.hasNext()){
Element next = it.next();
long itemID2 = docIndex.get(next.index());
double similarity = next.get();
//System.out.println(itemID1+ " : "+itemID2+" : "+similarity);
if (similarity < -1.0) {
similarity = -1.0;
} else if (similarity > 1.0) {
similarity = 1.0;
}
itemItemCorrelation = new GenericItemSimilarity.ItemItemSimilarity(itemID1, itemID2, similarity);
corrMatrix.add(itemItemCorrelation);
}
}
reader.close();
n++;
logger.info("File "+fileEntry.getName()+" readed ("+n+"/"+folder.listFiles().length+")");
}
}
}
return corrMatrix;
}
public static TIntLongHashMap getDocIndex(String docIndex) throws IOException{
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
TIntLongHashMap map = new TIntLongHashMap();
SequenceFile.Reader docIndexReader = new SequenceFile.Reader(fs, new Path(docIndex), conf);
IntWritable key = new IntWritable();
Text value = new Text();
while (docIndexReader.next(key, value)) {
map.put(key.get(), Long.parseLong(value.toString()));
}
return map;
}
At the end, in your recommendation class you call this:
TIntLongHashMap docIndex = ItemPairwiseSimilarityUtil.getDocIndex(filename);
TLongObjectHashMap correlationMatrix = ItemPairwiseSimilarityUtil.correlatedItems(folder, docIndex);
Where filename is your docIndex filename, and folder is the folder of the item-similarity files. At the end, this is nothing more than item-item based recommendation.
Hope this can help you