mahout | 易学教程

Does Mahout provide a way to determine similarity between content (for content-based recommendations)?

阅读更多关于 Does Mahout provide a way to determine similarity between content (for content-based recommendations)?

Does Mahout provide a way to determine similarity between content? I would like to produce content-based recommendations as part of a web application. I know Mahout is good at taking user-ratings matrices and producing recommendations based off of them, but I am not interested in collaborative (ratings-based) recommendations. I want to score how well two pieces of text match and then recommend items that match most closely to text that I store for users in their user profile... I've read Mahout's documentation, and it looks like it facilitates mainly the collaborative (ratings-based)

Does Mahout provide a way to determine similarity between content (for content-based recommendations)?

阅读更多关于 Does Mahout provide a way to determine similarity between content (for content-based recommendations)?

问题 Does Mahout provide a way to determine similarity between content? I would like to produce content-based recommendations as part of a web application. I know Mahout is good at taking user-ratings matrices and producing recommendations based off of them, but I am not interested in collaborative (ratings-based) recommendations. I want to score how well two pieces of text match and then recommend items that match most closely to text that I store for users in their user profile... I've read

Vectorization in Apache Mahout

阅读更多关于 Vectorization in Apache Mahout

I am new to Mahout. I have a requirement to convert a text file to a vector for classification in later stage. Could anybody of of shed some light on these below questions? How to convert a text file to a vector in mahout? The file format is like "username|comment about item|rating" The data will be few TBs. So which algorithm implementable I can use for classification using the vector I suppose to create? Thanks, Arun Julian Ortega You can check these 2 examples that also somewhat do/explain how to use the Sequence File API. Here and here And you should definitely read this intro to text

K-means with really large matrix

阅读更多关于 K-means with really large matrix

I have to perform a k-means clustering on a really huge matrix (about 300.000x100.000 values which is more than 100Gb). I want to know if I can use R software to perform this or weka. My computer is a multiprocessor with 8Gb of ram and hundreds Gb of free space. I have enough space for calculations but loading such a matrix seems to be a problem with R (I don't think that using the bigmemory package would help me and big matrix use automatically all my RAM then my swap file if not enough space). So my question is : what software should I use (eventually in association with some other packages

Full utilization of all cores in Hadoop pseudo-distributed mode

阅读更多关于 Full utilization of all cores in Hadoop pseudo-distributed mode

问题 I am running a task in pseudo-distributed mode on my 4 core laptop. How can I ensure that all cores are effectively used. Currently my job tracker shows that only one job is executing at a time. Does that mean only one core is used? The following are my configuration files. conf/core-site.xml: <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration> conf/hdfs-site.xml: <configuration> <property> <name>dfs.replication</name>

Mahout : To read a custom input file

阅读更多关于 Mahout : To read a custom input file

问题 I was playing with Mahout and found that the FileDataModel accepts data in the format userId,itemId,pref(long,long,Double). I have some data which is of the format String,long,double What is the best/easiest method to work with this dataset on Mahout? 回答1: One way to do this is by creating an extension of FileDataModel. You'll need to override the readUserIDFromString(String value) method to use some kind of resolver do the conversion. You can use one of the implementations of IDMigrator, as

Java's Mahout equivalent in Python

阅读更多关于 Java's Mahout equivalent in Python

Java based Mahout's goal is to build scalable machine learning libraries. Are there any equivalent libraries in Python ? scikits learn is highly recommended http://scikit-learn.sourceforge.net/ sunan Spark MLlib is recommmended. It is a scalable machine learning lib, can read data from HDFS and of course runs on top of Spark. You can access it via PySpark (see the Programming Guide 's Python examples). Orange is supposedly pretty decent, from what I've heard, but I've never used it personally. PyML might be worth taking a look at as well. Also, Monte . pysuggest is a Python wrapper for SUGGEST

Entity Extraction/Recognition with free tools while feeding Lucene Index

阅读更多关于 Entity Extraction/Recognition with free tools while feeding Lucene Index

I'm currently investigating the options to extract person names, locations, tech words and categories from text (a lot articles from the web) which will then feeded into a Lucene/ElasticSearch index. The additional information is then added as metadata and should increase precision of the search. E.g. when someone queries 'wicket' he should be able to decide whether he means the cricket sport or the Apache project. I tried to implement this on my own with minor success so far. Now I found a lot tools, but I'm not sure if they are suited for this task and which of them integrates good with

How mahout's recommendation evaluator works

阅读更多关于 How mahout's recommendation evaluator works

问题 Can anyone tell me how does mahout's RecommenderIRStatsEvaluator work? More specifically how it randomly splits training and testing data and what data the result is compare against? Based on my understating, you need some sort of ideal/expected result which you need to compare against actual result from the recommendation algorithm to find out TP or FP and thus compute precision or recall. But it looks like mahout provides a precision/recall score without that ideal/result. 回答1: The data is

Full utilization of all cores in Hadoop pseudo-distributed mode

阅读更多关于 Full utilization of all cores in Hadoop pseudo-distributed mode

I am running a task in pseudo-distributed mode on my 4 core laptop. How can I ensure that all cores are effectively used. Currently my job tracker shows that only one job is executing at a time. Does that mean only one core is used? The following are my configuration files. conf/core-site.xml: <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration> conf/hdfs-site.xml: <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration> conf/mapred-site.xml: <configuration> <property> <name>mapred