apache-spark-mllib | 易学教程

How to run Multi threaded jobs in apache spark using scala or python?

阅读更多关于 How to run Multi threaded jobs in apache spark using scala or python?

问题 I am facing a problem related to concurrency in spark which is stopping me from using it in production but I know there is a way out of it. I am trying to run Spark ALS on 7 million users for a billion products using order history. Firstly I am taking a list of distinct Users and then running a loop on these users to get recommendations, which is pretty slow process and will take days to get recommendations for all users. I tried doing cartesian users and products to get recommendations for

How to use CrossValidator to choose between different models

阅读更多关于 How to use CrossValidator to choose between different models

问题 I know that I can use a CrossValidator to tune a single model. But what is the suggested approach for evaluating different models against each other? For example, say that I wanted to evaluate a LogisticRegression classifier against a LinearSVC classifier using CrossValidator. 回答1: After familiarizing myself a bit with the API, I solved this problem by implementing a custom Estimator that wraps two or more estimators it can delegate to, where the selected estimator is controlled by a single

Spark LDA woes - prediction and OOM questions

阅读更多关于 Spark LDA woes - prediction and OOM questions

问题 I'm evaluating Spark 1.6.0 to build and predict against large (millions of docs, millions of features, thousands of topics) LDA models, something I can accomplish pretty easily with Yahoo! LDA. Starting small, following the Java examples, I built a 100K doc/600K feature/250 topic/100 iteration model using the Distributed model/EM optimizer. The model built fine and the resulting topics were coherent. I then wrote a wrapper around the new single-document prediction routine (SPARK-10809; which

Spark RDD: How to calculate statistics most efficiently?

阅读更多关于 Spark RDD: How to calculate statistics most efficiently?

问题 Assuming the existence of an RDD of tuples similar to the following: (key1, 1) (key3, 9) (key2, 3) (key1, 4) (key1, 5) (key3, 2) (key2, 7) ... What is the most efficient (and, ideally, distributed) way to compute statistics corresponding to each key? (At the moment, I am looking to calculate standard deviation / variance, in particular.) As I understand it, my options amount to: Use the colStats function in MLLib: This approach has the advantage of easily-adaptable to use other mllib.stat

Spark RDD: How to calculate statistics most efficiently?

阅读更多关于 Spark RDD: How to calculate statistics most efficiently?

Join two Spark mllib pipelines together

阅读更多关于 Join two Spark mllib pipelines together

问题 I have two separate DataFrames which each have several differing processing stages which I use mllib transformers in a pipeline to handle. I now want to join these two pipelines together, keeping the features (columns) from each DataFrame . Scikit-learn has the FeatureUnion class for handling this, and I can't seem to find an equivalent for mllib . I can add a custom transformer stage at the end of one pipeline that takes the DataFrame produced by the other pipeline as an attribute and join

ALS model - predicted full_u * v^t * v ratings are very high

阅读更多关于 ALS model - predicted full_u * v^t * v ratings are very high

问题 I'm predicting ratings in between processes that batch train the model. I'm using the approach outlined here: ALS model - how to generate full_u * v^t * v? ! rm -rf ml-1m.zip ml-1m ! wget --quiet http://files.grouplens.org/datasets/movielens/ml-1m.zip ! unzip ml-1m.zip ! mv ml-1m/ratings.dat . from pyspark.mllib.recommendation import Rating ratingsRDD = sc.textFile('ratings.dat') \ .map(lambda l: l.split("::")) \ .map(lambda p: Rating( user = int(p[0]), product = int(p[1]), rating = float(p[2

Spark SQL removing white spaces

阅读更多关于 Spark SQL removing white spaces

问题 I have a simple Spark Program which reads a JSON file and emits a CSV file. IN the JSON data the values contain leading and trailing white spaces, when I emit the CSV the leading and trailing white spaces are gone. Is there a way I can retain the spaces. I tried many options like ignoreTrailingWhiteSpace , ignoreLeadingWhiteSpace but no luck input.json {"key" : "k1", "value1": "Good String", "value2": "Good String"} {"key" : "k1", "value1": "With Spaces ", "value2": "With Spaces "} {"key" :

How to get the probability per instance in classifications models in spark.mllib

阅读更多关于 How to get the probability per instance in classifications models in spark.mllib

问题 I'm using spark.mllib.classification.{LogisticRegressionModel, LogisticRegressionWithSGD} and spark.mllib.tree.RandomForest for classification. Using these packages I produce classification models. Only these models predict a specific class per instance. In Weka, we can get the exact probability for each instance to be of each class. How can we do it using these packages? In LogisticRegressionModel we can set the threshold. So I've created a function that check the results for each point on a

How to get the probability per instance in classifications models in spark.mllib

阅读更多关于 How to get the probability per instance in classifications models in spark.mllib