apache-spark-ml

Serve real-time predictions with trained Spark ML model [duplicate]

我与影子孤独终老i 提交于 2019-12-03 07:50:11
This question already has answers here : How to serve a Spark MLlib model? (4 answers) We are currently testing a prediction engine based on Spark's implementation of LDA in Python: https://spark.apache.org/docs/2.2.0/ml-clustering.html#latent-dirichlet-allocation-lda https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.clustering.LDA (we are using the pyspark.ml package, not pyspark.mllib) We were able to succesfuly train a model on a Spark cluster (using Google Cloud Dataproc). Now we are trying to use the model to serve real-time predictions as an API (e.g. flask

Caching intermediate results in Spark ML pipeline

 ̄綄美尐妖づ 提交于 2019-12-03 02:29:33
Lately I'm planning to migrate my standalone python ML code to spark. The ML pipeline in spark.ml turns out quite handy, with streamlined API for chaining up algorithm stages and hyper-parameter grid search. Still, I found its support for one important feature obscure in existing documents: caching of intermediate results . The importance of this feature arise when the pipeline involves computation intensive stages. For example, in my case I use a huge sparse matrix to perform multiple moving averages on time series data in order to form input features. The structure of the matrix is

Preserve index-string correspondence spark string indexer

寵の児 提交于 2019-12-02 23:21:43
Spark's StringIndexer is quite useful, but it's common to need to retrieve the correspondences between the generated index values and the original strings, and it seems like there should be a built-in way to accomplish this. I'll illustrate using this simple example from the Spark documentation : from pyspark.ml.feature import StringIndexer df = sqlContext.createDataFrame( [(0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")], ["id", "category"]) indexer = StringIndexer(inputCol="category", outputCol="categoryIndex") indexed_df = indexer.fit(df).transform(df) This simplified case gives

Using CategoricalFeaturesInfo with DecisionTreeClassifier method in Spark

与世无争的帅哥 提交于 2019-12-02 14:18:22
问题 I have to use this code: val dt = new DecisionTreeClassifier().setLabelCol("indexedLabel").setFeaturesCol("indexedFeatures").setImpurity(impurity).setMaxBins(maxBins).setMaxDepth(maxDepth); I need to add categorical features information so that the decision tree doesn't treat the indexedCategoricalFeatures as numerical. I have this map: val categoricalFeaturesInfo = Map(143 -> 126, 144 -> 5, 145 -> 216, 146 -> 100, 147 -> 14, 148 -> 8, 149 -> 19, 150 -> 7); However it only works with

String similarity with OR condition in MinHash Spark ML

拥有回忆 提交于 2019-12-02 13:43:24
问题 I have two datasets, first one is large reference dataset and from second dataset will find best match from first dataset through MinHash algorithm. val dataset1 = +-------------+----------+------+------+-----------------------+ | x'| y'| a'| b'| dataString(x'+y'+a')| +-------------+----------+------+------+-----------------------+ | John| Smith| 55649| 28200| John|Smith|55649| | Emma| Morales| 78439| 34200| Emma|Morales|78439| | Janet| Alvarado| 89488| 29103| Janet|Alvarado|89488| |

Using CategoricalFeaturesInfo with DecisionTreeClassifier method in Spark

倾然丶 夕夏残阳落幕 提交于 2019-12-02 09:23:16
I have to use this code: val dt = new DecisionTreeClassifier().setLabelCol("indexedLabel").setFeaturesCol("indexedFeatures").setImpurity(impurity).setMaxBins(maxBins).setMaxDepth(maxDepth); I need to add categorical features information so that the decision tree doesn't treat the indexedCategoricalFeatures as numerical. I have this map: val categoricalFeaturesInfo = Map(143 -> 126, 144 -> 5, 145 -> 216, 146 -> 100, 147 -> 14, 148 -> 8, 149 -> 19, 150 -> 7); However it only works with DecisionTree.trainClassifier method. I can't use this method because it accepts different arguments than the

Text classification - how to approach

谁说我不能喝 提交于 2019-12-02 09:20:56
问题 I'll try do describe what I have in mind. There is a text content stored in MS SQL database. Content comes daily as a stream. Some people go through the content every day and, if the content fits certain criteria, mark it as validated. There is only one category. It's either "valid" or not. What I want is to create a model based on already validated content, save it and using this model to "pre-validate" or mark new incoming content. Also once in a while to update the model based on a newly

String similarity with OR condition in MinHash Spark ML

…衆ロ難τιáo~ 提交于 2019-12-02 06:43:43
I have two datasets, first one is large reference dataset and from second dataset will find best match from first dataset through MinHash algorithm. val dataset1 = +-------------+----------+------+------+-----------------------+ | x'| y'| a'| b'| dataString(x'+y'+a')| +-------------+----------+------+------+-----------------------+ | John| Smith| 55649| 28200| John|Smith|55649| | Emma| Morales| 78439| 34200| Emma|Morales|78439| | Janet| Alvarado| 89488| 29103| Janet|Alvarado|89488| | Elizabeth| K| 36935| 38101| Elizabeth|K|36935| | Cristin| Cruz| 75716| 70015| Cristin|Cruz|75716| | Jack|

How to eval spark.ml model without DataFrames/SparkContext?

℡╲_俬逩灬. 提交于 2019-12-02 06:07:49
问题 With Spark MLLib, I'd build a model (like RandomForest ), and then it was possible to eval it outside of Spark by loading the model and using predict on it passing a vector of features. It seems like with Spark ML, predict is now called transform and only acts on a DataFrame . Is there any way to build a DataFrame outside of Spark since it seems like one needs a SparkContext to build a DataFrame? Am I missing something? 回答1: Re: Is there any way to build a DataFrame outside of Spark? It is

Spark: How to get probabilities and AUC for Bernoulli Naive Bayes?

ε祈祈猫儿з 提交于 2019-12-02 05:14:57
问题 I'm running a Bernoulli Naive Bayes using code: val splits = MyData.randomSplit(Array(0.75, 0.25), seed = 2L) val training = splits(0).cache() val test = splits(1) val model = NaiveBayes.train(training, lambda = 3.0, modelType = "bernoulli") My question is how can I get the probability of membership to class 0 (or 1) and count AUC. I want to get similar result to LogisticRegressionWithSGD or SVMWithSGD where I was using this code: val numIterations = 100 val model = SVMWithSGD.train(training,