mahout | 易学教程

Converting CSV to SequenceFile

阅读更多关于 Converting CSV to SequenceFile

I have a CSV file which I would like to convert to a SequenceFile, which I would ultimately use to create NamedVectors to use in a clustering job. I've been using the seqdirectory command to try to make a SequenceFile, and then fed that output into seq2sparse with the -nv option to create NamedVectors. It seems like this is giving one big vector as an output, but I ultimately want each line of my CSV to become a NamedVector. Where am I going wrong? Julian Ortega seqdirectory command takes every file as a document, so in reality, you only have one document, hence you only get one vector. To

Apache Mahout：适合所有人的可扩展机器学习框架

阅读更多关于 Apache Mahout：适合所有人的可扩展机器学习框架

在软件的世界中，两年就像是无比漫长的时光。在过去两年中，我们看到了社交媒体的风生水起、大规模集群计算的商业化（归功于 Amazon 和 RackSpace 这样的参与者），也看到了数据的迅猛增长以及我们诠释这些数据的能力的显著提升。“ Apache Mahout 简介 ” 最初在 developerWorks 上发表也已经是两年之前的事情。这之后，Mahout 社区（以及项目的代码库和功能）取得了长足的发展。Mahout 也得到了全球各地各种规模的企业的积极采用。在我撰写的 Apache Mahout 简介中，我介绍了许多机器学习的概念以及使用 Mahout 提供的一套算法的基础知识。我在那篇文章中介绍的概念仍然有效，但这套算法已经发生了显著的变化。这篇文章不会重述基础知识，而是重点关注 Mahout 的当前状态，以及如何利用 Amazon 的 EC2 服务和包含 700 万个电子邮件文档的数据集在一个计算集群上扩展 Mahout。如需回顾基础知识，请参阅参考资料部分，特别是《Mahout 实战》一书。此外，我假设读者具备 Apache Hadoop 和 Map-Reduce 范式方面的基本知识。（有关 Hadoop 的更多信息，请参阅参考资料部分。） Mahout 现状 Mahout 在极短的时间内取得了长足的发展。项目的关注点仍然可以归纳为我所说的 “3 个要点

Linux下Mahout安装遇到的问题

阅读更多关于 Linux下Mahout安装遇到的问题

1. Mahout 0.5版本有taste-web，以后版本就没有了。 2. 使用JDK1.7编译Mahout（0.5—0.7）会出现bug，编译不过去。问题貌似是关于接口的方法没实现的，但是这只是与JDK自带的那个接口重名而已。（太久了，忘了什么错了。。。）解决这个问题需要下载补丁(https://issues.apache.org/jira/browse/MAHOUT-782)，在linux下执行patch 源文件补丁文件即可。来源： oschina 链接： https://my.oschina.net/u/268089/blog/138817

Hadoop 2.2.0 is compatible with Mahout 0.8?

阅读更多关于 Hadoop 2.2.0 is compatible with Mahout 0.8?

I have hadoop cluster version 2.2.0 running with mahout 0.8, is it compatible? Because whenever I run this command: bin/mahout recommenditembased --input mydata.dat --usersFile user.dat --numRecommendations 2 --output output/ --similarityClassname SIMILARITY_PEARSON_CORRELATION Give me this error: Exception in thread "main" java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected at org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174) at org.apache.mahout.common.AbstractJob.prepareJob(AbstractJob.java:614) at

TOP 10开源的推荐系统简介

阅读更多关于 TOP 10开源的推荐系统简介

最近这两年推荐系统特别火，本文搜集整理了一些比较好的开源推荐系统，即有轻量级的适用于做研究的SVDFeature、LibMF、LibFM等，也有重量级的适用于工业系统的Mahout、Oryx、EasyRecd等，供大家参考。PS：这里的top 10仅代表个人观点。 #1.SVDFeature 主页： http://svdfeature.apexlab.org/wiki/Main_Page 语言：C++ 一个feature-based协同过滤和排序工具，由上海交大Apex实验室开发，代码质量较高。在KDD Cup 2012中获得第一名，KDD Cup 2011中获得第三名，相关论文发表在2012的JMLR中，这足以说明它的高大上。 SVDFeature包含一个很灵活的Matrix Factorization推荐框架，能方便的实现SVD、SVD++等方法, 是单模型推荐算法中精度最高的一种。SVDFeature代码精炼，可以用相对较少的内存实现较大规模的单机版矩阵分解运算。另外含有Logistic regression的model，可以很方便的用来进行ensemble。 #2.LibMF 主页： http://www.csie.ntu.edu.tw/~cjlin/libmf/ 语言：C++ 作者 Chih-Jen Lin 来自大名鼎鼎的台湾国立大学，他们在机器学习领域享有盛名

How to directly send the output of a mapper-reducer to a another mapper-reducer without saving the output into the hdfs

阅读更多关于 How to directly send the output of a mapper-reducer to a another mapper-reducer without saving the output into the hdfs

问题 Problem Solved Eventually check my solution in the bottom Recently I am trying to run the recommender example in the chaper6 (listing 6.1 ~ 6.4)from the Mahout in Action. But I encountered a problem and I have googled around but I can't find the solution. Here is the problem: I have a pair of mapper-reducer public final class WikipediaToItemPrefsMapper extends Mapper<LongWritable, Text, VarLongWritable, VarLongWritable> { private static final Pattern NUMBERS = Pattern.compile("(\\d+)");

is it possible to use apache mahout without hadoop dependency?

阅读更多关于 is it possible to use apache mahout without hadoop dependency?

Is it possible to use Apache mahout without any dependency to Hadoop. I would like to use the mahout algorithm on a single computer by only including the mahout library inside my Java project but i dont want to use hadoop at all since i will be running on a single node anyway. Is that possible? Yes. Not all of Mahout depends on Hadoop, though much does. If you use a piece that depends on Hadoop, of course, you need Hadoop. But for example there is a substantial recommender engine code base that does not use Hadoop. You can embed a local Hadoop cluster/worker in a Java program. Definitely, yes.

Clustering — Sparse vector and Dense Vector

阅读更多关于 Clustering — Sparse vector and Dense Vector

问题 For clustering, Mahout input needs to be in vector form. There are two types of vector implementations. One is Sparse Vector and another is Dense Vector. What is difference between two ? Usage scenarios for Sparse and Dense ? 回答1: Concept-wise, most of the values in a sparse vector are zero, in a dense vector they are not. Same for dense and sparse matrices. The terms sparse and dense generally describe these properties, not only in Mahout. In Mahout the DenseVector assumes not too many zero

Classify data using Apache Mahout

阅读更多关于 Classify data using Apache Mahout

问题 I am trying to solve a simple classification problem. The Problem: I have a set of text and I have to categorize them based on the content. Solution using Mahout: I understood that I have to convert the input to a sequence file to generate the model. Yes, I was able to do this. Now, how do I categorize my test data? The 20News example only tests for correctness. But, I want to do the actual classification. I am not sure if I need to write code or use some existing classes available to

订阅 mahout