apache-spark-mllib

How to run Multi threaded jobs in apache spark using scala or python?

让人想犯罪 __ 提交于 2020-01-14 12:34:57
问题 I am facing a problem related to concurrency in spark which is stopping me from using it in production but I know there is a way out of it. I am trying to run Spark ALS on 7 million users for a billion products using order history. Firstly I am taking a list of distinct Users and then running a loop on these users to get recommendations, which is pretty slow process and will take days to get recommendations for all users. I tried doing cartesian users and products to get recommendations for

How to use CrossValidator to choose between different models

做~自己de王妃 提交于 2020-01-13 19:25:49
问题 I know that I can use a CrossValidator to tune a single model. But what is the suggested approach for evaluating different models against each other? For example, say that I wanted to evaluate a LogisticRegression classifier against a LinearSVC classifier using CrossValidator. 回答1: After familiarizing myself a bit with the API, I solved this problem by implementing a custom Estimator that wraps two or more estimators it can delegate to, where the selected estimator is controlled by a single

Spark LDA woes - prediction and OOM questions

无人久伴 提交于 2020-01-13 13:05:29
问题 I'm evaluating Spark 1.6.0 to build and predict against large (millions of docs, millions of features, thousands of topics) LDA models, something I can accomplish pretty easily with Yahoo! LDA. Starting small, following the Java examples, I built a 100K doc/600K feature/250 topic/100 iteration model using the Distributed model/EM optimizer. The model built fine and the resulting topics were coherent. I then wrote a wrapper around the new ​single-document prediction routine (SPARK-10809; which

Spark RDD: How to calculate statistics most efficiently?

懵懂的女人 提交于 2020-01-11 10:38:28
问题 Assuming the existence of an RDD of tuples similar to the following: (key1, 1) (key3, 9) (key2, 3) (key1, 4) (key1, 5) (key3, 2) (key2, 7) ... What is the most efficient (and, ideally, distributed) way to compute statistics corresponding to each key? (At the moment, I am looking to calculate standard deviation / variance, in particular.) As I understand it, my options amount to: Use the colStats function in MLLib: This approach has the advantage of easily-adaptable to use other mllib.stat

Spark RDD: How to calculate statistics most efficiently?

北慕城南 提交于 2020-01-11 10:38:28
问题 Assuming the existence of an RDD of tuples similar to the following: (key1, 1) (key3, 9) (key2, 3) (key1, 4) (key1, 5) (key3, 2) (key2, 7) ... What is the most efficient (and, ideally, distributed) way to compute statistics corresponding to each key? (At the moment, I am looking to calculate standard deviation / variance, in particular.) As I understand it, my options amount to: Use the colStats function in MLLib: This approach has the advantage of easily-adaptable to use other mllib.stat

Join two Spark mllib pipelines together

♀尐吖头ヾ 提交于 2020-01-11 03:31:12
问题 I have two separate DataFrames which each have several differing processing stages which I use mllib transformers in a pipeline to handle. I now want to join these two pipelines together, keeping the features (columns) from each DataFrame . Scikit-learn has the FeatureUnion class for handling this, and I can't seem to find an equivalent for mllib . I can add a custom transformer stage at the end of one pipeline that takes the DataFrame produced by the other pipeline as an attribute and join

ALS model - predicted full_u * v^t * v ratings are very high

本小妞迷上赌 提交于 2020-01-10 14:15:31
问题 I'm predicting ratings in between processes that batch train the model. I'm using the approach outlined here: ALS model - how to generate full_u * v^t * v? ! rm -rf ml-1m.zip ml-1m ! wget --quiet http://files.grouplens.org/datasets/movielens/ml-1m.zip ! unzip ml-1m.zip ! mv ml-1m/ratings.dat . from pyspark.mllib.recommendation import Rating ratingsRDD = sc.textFile('ratings.dat') \ .map(lambda l: l.split("::")) \ .map(lambda p: Rating( user = int(p[0]), product = int(p[1]), rating = float(p[2

Spark SQL removing white spaces

守給你的承諾、 提交于 2020-01-10 06:08:51
问题 I have a simple Spark Program which reads a JSON file and emits a CSV file. IN the JSON data the values contain leading and trailing white spaces, when I emit the CSV the leading and trailing white spaces are gone. Is there a way I can retain the spaces. I tried many options like ignoreTrailingWhiteSpace , ignoreLeadingWhiteSpace but no luck input.json {"key" : "k1", "value1": "Good String", "value2": "Good String"} {"key" : "k1", "value1": "With Spaces ", "value2": "With Spaces "} {"key" :

How to get the probability per instance in classifications models in spark.mllib

雨燕双飞 提交于 2020-01-09 11:56:32
问题 I'm using spark.mllib.classification.{LogisticRegressionModel, LogisticRegressionWithSGD} and spark.mllib.tree.RandomForest for classification. Using these packages I produce classification models. Only these models predict a specific class per instance. In Weka, we can get the exact probability for each instance to be of each class. How can we do it using these packages? In LogisticRegressionModel we can set the threshold. So I've created a function that check the results for each point on a

How to get the probability per instance in classifications models in spark.mllib

倾然丶 夕夏残阳落幕 提交于 2020-01-09 11:56:07
问题 I'm using spark.mllib.classification.{LogisticRegressionModel, LogisticRegressionWithSGD} and spark.mllib.tree.RandomForest for classification. Using these packages I produce classification models. Only these models predict a specific class per instance. In Weka, we can get the exact probability for each instance to be of each class. How can we do it using these packages? In LogisticRegressionModel we can set the threshold. So I've created a function that check the results for each point on a