apache-spark-mllib

Spark not utilizing all the core while running LinearRegressionwithSGD

别说谁变了你拦得住时间么 提交于 2019-12-02 02:35:11
I am running Spark on my local machine (16G,8 cpu cores). I was trying to train linear regression model on dataset of size 300MB. I checked the cpu statistics and also the programs running, it just executes one thread. The documentation says they have implemented distributed version of SGD. http://spark.apache.org/docs/latest/mllib-linear-methods.html#implementation-developer from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD, LinearRegressionModel from pyspark import SparkContext def parsePoint(line): values = [float(x) for x in line.replace(',', ' ').split(' ')]

How to eval spark.ml model without DataFrames/SparkContext?

≯℡__Kan透↙ 提交于 2019-12-02 02:25:36
With Spark MLLib, I'd build a model (like RandomForest ), and then it was possible to eval it outside of Spark by loading the model and using predict on it passing a vector of features. It seems like with Spark ML, predict is now called transform and only acts on a DataFrame . Is there any way to build a DataFrame outside of Spark since it seems like one needs a SparkContext to build a DataFrame? Am I missing something? Re: Is there any way to build a DataFrame outside of Spark? It is not possible. DataFrames live inside SQLContext with it living in SparkContext. Perhaps you could work it

Spark RDD: How to calculate statistics most efficiently?

↘锁芯ラ 提交于 2019-12-02 01:16:28
Assuming the existence of an RDD of tuples similar to the following: (key1, 1) (key3, 9) (key2, 3) (key1, 4) (key1, 5) (key3, 2) (key2, 7) ... What is the most efficient (and, ideally, distributed) way to compute statistics corresponding to each key? (At the moment, I am looking to calculate standard deviation / variance, in particular.) As I understand it, my options amount to: Use the colStats function in MLLib : This approach has the advantage of easily-adaptable to use other mllib.stat functions later, if other statistical computations are deemed necessary. However, it operates on an RDD

Anomaly detection with PCA in Spark

☆樱花仙子☆ 提交于 2019-12-02 00:21:53
I read the following article Anomaly detection with Principal Component Analysis (PCA) In the article is written following: • PCA algorithm basically transforms data readings from an existing coordinate system into a new coordinate system. • The closer data readings are to the center of the new coordinate system, the closer these readings are to an optimum value. • The anomaly score is calculated using the Mahalanobis distance between a reading and the mean of all readings, which is the center of the transformed coordinate system. Can anyone describe me more in detail about anomaly detection

StackOverflow-error when applying pyspark ALS's “recommendProductsForUsers” (although cluster of >300GB Ram available)

心不动则不痛 提交于 2019-12-01 23:39:32
Looking for expertise to guide me on issue below. Background: I'm trying to get going with a basic PySpark script inspired on this example As deploy infrastructure I use a Google Cloud Dataproc Cluster. Cornerstone in my code is the function "recommendProductsForUsers" documented here which gives me back the top X products for all users in the model Issue I incur The ALS.Train script runs smoothly and scales well on GCP (Easily >1mn customers). However, applying the predictions: i.e. using funcitons 'PredictAll' or 'recommendProductsForUsers', does not scale at all. My script runs smooth for a

Spark: How to get probabilities and AUC for Bernoulli Naive Bayes?

拟墨画扇 提交于 2019-12-01 22:46:27
I'm running a Bernoulli Naive Bayes using code: val splits = MyData.randomSplit(Array(0.75, 0.25), seed = 2L) val training = splits(0).cache() val test = splits(1) val model = NaiveBayes.train(training, lambda = 3.0, modelType = "bernoulli") My question is how can I get the probability of membership to class 0 (or 1) and count AUC. I want to get similar result to LogisticRegressionWithSGD or SVMWithSGD where I was using this code: val numIterations = 100 val model = SVMWithSGD.train(training, numIterations) model.clearThreshold() // Compute raw scores on the test set. val labelAndPreds = test

How does Spark keep track of the splits in randomSplit?

纵然是瞬间 提交于 2019-12-01 22:37:37
This question explains how Spark's random split works, How does Sparks RDD.randomSplit actually split the RDD , but I don't understand how spark keeps track of what values went to one split so that those same values don't go to the second split. If we look at the implementation of randomSplit: def randomSplit(weights: Array[Double], seed: Long): Array[DataFrame] = { // It is possible that the underlying dataframe doesn't guarantee the ordering of rows in its // constituent partitions each time a split is materialized which could result in // overlapping splits. To prevent this, we explicitly

FPgrowth computing association in pyspark vs scala

北城余情 提交于 2019-12-01 22:05:04
问题 Using : http://spark.apache.org/docs/1.6.1/mllib-frequent-pattern-mining.html Python Code: from pyspark.mllib.fpm import FPGrowth model = FPGrowth.train(dataframe,0.01,10) Scala: import org.apache.spark.mllib.fpm.FPGrowth import org.apache.spark.rdd.RDD val data = sc.textFile("data/mllib/sample_fpgrowth.txt") val transactions: RDD[Array[String]] = data.map(s => s.trim.split(' ')) val fpg = new FPGrowth() .setMinSupport(0.2) .setNumPartitions(10) val model = fpg.run(transactions) model

How to extract rules from decision tree spark MLlib

旧街凉风 提交于 2019-12-01 18:13:25
I am using Spark MLlib 1.4.1 to create decisionTree model. Now I want to extract rules from decision tree. How can I extract rules ? You can get the full model as a string by calling model.toDebugString(), or save it as JSON by calling model.save(sc, filePath). The documentation is here , which contains a example with a small sample data that you can inspect the output format in command line. Here I formatted the script that you can directly past and run. from numpy import array from pyspark.mllib.regression import LabeledPoint from pyspark.mllib.tree import DecisionTree data = [ LabeledPoint

How to extract rules from decision tree spark MLlib

一曲冷凌霜 提交于 2019-12-01 16:12:49
问题 I am using Spark MLlib 1.4.1 to create decisionTree model. Now I want to extract rules from decision tree. How can I extract rules ? 回答1: You can get the full model as a string by calling model.toDebugString(), or save it as JSON by calling model.save(sc, filePath). The documentation is here, which contains a example with a small sample data that you can inspect the output format in command line. Here I formatted the script that you can directly past and run. from numpy import array from