apache-spark-ml | 易学教程

Serve real-time predictions with trained Spark ML model [duplicate]

阅读更多关于 Serve real-time predictions with trained Spark ML model [duplicate]

This question already has answers here : How to serve a Spark MLlib model? (4 answers) We are currently testing a prediction engine based on Spark's implementation of LDA in Python: https://spark.apache.org/docs/2.2.0/ml-clustering.html#latent-dirichlet-allocation-lda https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.clustering.LDA (we are using the pyspark.ml package, not pyspark.mllib) We were able to succesfuly train a model on a Spark cluster (using Google Cloud Dataproc). Now we are trying to use the model to serve real-time predictions as an API (e.g. flask

Caching intermediate results in Spark ML pipeline

阅读更多关于 Caching intermediate results in Spark ML pipeline

Lately I'm planning to migrate my standalone python ML code to spark. The ML pipeline in spark.ml turns out quite handy, with streamlined API for chaining up algorithm stages and hyper-parameter grid search. Still, I found its support for one important feature obscure in existing documents: caching of intermediate results . The importance of this feature arise when the pipeline involves computation intensive stages. For example, in my case I use a huge sparse matrix to perform multiple moving averages on time series data in order to form input features. The structure of the matrix is

Preserve index-string correspondence spark string indexer

阅读更多关于 Preserve index-string correspondence spark string indexer

Spark's StringIndexer is quite useful, but it's common to need to retrieve the correspondences between the generated index values and the original strings, and it seems like there should be a built-in way to accomplish this. I'll illustrate using this simple example from the Spark documentation : from pyspark.ml.feature import StringIndexer df = sqlContext.createDataFrame( [(0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")], ["id", "category"]) indexer = StringIndexer(inputCol="category", outputCol="categoryIndex") indexed_df = indexer.fit(df).transform(df) This simplified case gives

Using CategoricalFeaturesInfo with DecisionTreeClassifier method in Spark

阅读更多关于 Using CategoricalFeaturesInfo with DecisionTreeClassifier method in Spark

问题 I have to use this code: val dt = new DecisionTreeClassifier().setLabelCol("indexedLabel").setFeaturesCol("indexedFeatures").setImpurity(impurity).setMaxBins(maxBins).setMaxDepth(maxDepth); I need to add categorical features information so that the decision tree doesn't treat the indexedCategoricalFeatures as numerical. I have this map: val categoricalFeaturesInfo = Map(143 -> 126, 144 -> 5, 145 -> 216, 146 -> 100, 147 -> 14, 148 -> 8, 149 -> 19, 150 -> 7); However it only works with

String similarity with OR condition in MinHash Spark ML

阅读更多关于 String similarity with OR condition in MinHash Spark ML

问题 I have two datasets, first one is large reference dataset and from second dataset will find best match from first dataset through MinHash algorithm. val dataset1 = +-------------+----------+------+------+-----------------------+ | x'| y'| a'| b'| dataString(x'+y'+a')| +-------------+----------+------+------+-----------------------+ | John| Smith| 55649| 28200| John|Smith|55649| | Emma| Morales| 78439| 34200| Emma|Morales|78439| | Janet| Alvarado| 89488| 29103| Janet|Alvarado|89488| |

Using CategoricalFeaturesInfo with DecisionTreeClassifier method in Spark

阅读更多关于 Using CategoricalFeaturesInfo with DecisionTreeClassifier method in Spark

I have to use this code: val dt = new DecisionTreeClassifier().setLabelCol("indexedLabel").setFeaturesCol("indexedFeatures").setImpurity(impurity).setMaxBins(maxBins).setMaxDepth(maxDepth); I need to add categorical features information so that the decision tree doesn't treat the indexedCategoricalFeatures as numerical. I have this map: val categoricalFeaturesInfo = Map(143 -> 126, 144 -> 5, 145 -> 216, 146 -> 100, 147 -> 14, 148 -> 8, 149 -> 19, 150 -> 7); However it only works with DecisionTree.trainClassifier method. I can't use this method because it accepts different arguments than the

Text classification - how to approach

阅读更多关于 Text classification - how to approach

问题 I'll try do describe what I have in mind. There is a text content stored in MS SQL database. Content comes daily as a stream. Some people go through the content every day and, if the content fits certain criteria, mark it as validated. There is only one category. It's either "valid" or not. What I want is to create a model based on already validated content, save it and using this model to "pre-validate" or mark new incoming content. Also once in a while to update the model based on a newly

String similarity with OR condition in MinHash Spark ML

阅读更多关于 String similarity with OR condition in MinHash Spark ML

I have two datasets, first one is large reference dataset and from second dataset will find best match from first dataset through MinHash algorithm. val dataset1 = +-------------+----------+------+------+-----------------------+ | x'| y'| a'| b'| dataString(x'+y'+a')| +-------------+----------+------+------+-----------------------+ | John| Smith| 55649| 28200| John|Smith|55649| | Emma| Morales| 78439| 34200| Emma|Morales|78439| | Janet| Alvarado| 89488| 29103| Janet|Alvarado|89488| | Elizabeth| K| 36935| 38101| Elizabeth|K|36935| | Cristin| Cruz| 75716| 70015| Cristin|Cruz|75716| | Jack|

How to eval spark.ml model without DataFrames/SparkContext?

阅读更多关于 How to eval spark.ml model without DataFrames/SparkContext?

问题 With Spark MLLib, I'd build a model (like RandomForest ), and then it was possible to eval it outside of Spark by loading the model and using predict on it passing a vector of features. It seems like with Spark ML, predict is now called transform and only acts on a DataFrame . Is there any way to build a DataFrame outside of Spark since it seems like one needs a SparkContext to build a DataFrame? Am I missing something? 回答1: Re: Is there any way to build a DataFrame outside of Spark? It is

Spark: How to get probabilities and AUC for Bernoulli Naive Bayes?

阅读更多关于 Spark: How to get probabilities and AUC for Bernoulli Naive Bayes?

问题 I'm running a Bernoulli Naive Bayes using code: val splits = MyData.randomSplit(Array(0.75, 0.25), seed = 2L) val training = splits(0).cache() val test = splits(1) val model = NaiveBayes.train(training, lambda = 3.0, modelType = "bernoulli") My question is how can I get the probability of membership to class 0 (or 1) and count AUC. I want to get similar result to LogisticRegressionWithSGD or SVMWithSGD where I was using this code: val numIterations = 100 val model = SVMWithSGD.train(training,