apache-spark-mllib | 易学教程

String similarity with OR condition in MinHash Spark ML

阅读更多关于 String similarity with OR condition in MinHash Spark ML

问题 I have two datasets, first one is large reference dataset and from second dataset will find best match from first dataset through MinHash algorithm. val dataset1 = +-------------+----------+------+------+-----------------------+ | x'| y'| a'| b'| dataString(x'+y'+a')| +-------------+----------+------+------+-----------------------+ | John| Smith| 55649| 28200| John|Smith|55649| | Emma| Morales| 78439| 34200| Emma|Morales|78439| | Janet| Alvarado| 89488| 29103| Janet|Alvarado|89488| |

Spark: StringIndexer on sentences

阅读更多关于 Spark: StringIndexer on sentences

问题 I am trying to do something StringIndexer on a column of sentences, i.e. transforming list of words to list of integers. For example: input dataset : (1, ["I", "like", "Spark"]) (2, ["I", "hate", "Spark"]) I expected the output after StringIndexer to be like: (1, [0, 2, 1]) (2, [0, 3, 1]) Ideally, I would like to make such transformation as part of Pipeline, so that I can chain couple transformer together and serialize for online serving. Is this something Spark support natively? Thank you!

Using CategoricalFeaturesInfo with DecisionTreeClassifier method in Spark

阅读更多关于 Using CategoricalFeaturesInfo with DecisionTreeClassifier method in Spark

I have to use this code: val dt = new DecisionTreeClassifier().setLabelCol("indexedLabel").setFeaturesCol("indexedFeatures").setImpurity(impurity).setMaxBins(maxBins).setMaxDepth(maxDepth); I need to add categorical features information so that the decision tree doesn't treat the indexedCategoricalFeatures as numerical. I have this map: val categoricalFeaturesInfo = Map(143 -> 126, 144 -> 5, 145 -> 216, 146 -> 100, 147 -> 14, 148 -> 8, 149 -> 19, 150 -> 7); However it only works with DecisionTree.trainClassifier method. I can't use this method because it accepts different arguments than the

Text classification - how to approach

阅读更多关于 Text classification - how to approach

问题 I'll try do describe what I have in mind. There is a text content stored in MS SQL database. Content comes daily as a stream. Some people go through the content every day and, if the content fits certain criteria, mark it as validated. There is only one category. It's either "valid" or not. What I want is to create a model based on already validated content, save it and using this model to "pre-validate" or mark new incoming content. Also once in a while to update the model based on a newly

How to convert a mllib matrix to a spark dataframe?

阅读更多关于 How to convert a mllib matrix to a spark dataframe?

I want to pretty print the result of a correlation in a zeppelin notebook: val Row(coeff: Matrix) = Correlation.corr(data, "features").head One of the ways to achieve this is to convert the result into a DataFrame with each value in a separate column and call z.show() . However, looking into the Matrix api I don't see any way to do this. Is there another straight forward way to achieve this? Edit: The dataframe has 50 columns. Just converting to a string would not help as the output get truncated. Using the toString method should be the easiest and fastest way if you simply want to print the

String similarity with OR condition in MinHash Spark ML

阅读更多关于 String similarity with OR condition in MinHash Spark ML

I have two datasets, first one is large reference dataset and from second dataset will find best match from first dataset through MinHash algorithm. val dataset1 = +-------------+----------+------+------+-----------------------+ | x'| y'| a'| b'| dataString(x'+y'+a')| +-------------+----------+------+------+-----------------------+ | John| Smith| 55649| 28200| John|Smith|55649| | Emma| Morales| 78439| 34200| Emma|Morales|78439| | Janet| Alvarado| 89488| 29103| Janet|Alvarado|89488| | Elizabeth| K| 36935| 38101| Elizabeth|K|36935| | Cristin| Cruz| 75716| 70015| Cristin|Cruz|75716| | Jack|

How to eval spark.ml model without DataFrames/SparkContext?

阅读更多关于 How to eval spark.ml model without DataFrames/SparkContext?

问题 With Spark MLLib, I'd build a model (like RandomForest ), and then it was possible to eval it outside of Spark by loading the model and using predict on it passing a vector of features. It seems like with Spark ML, predict is now called transform and only acts on a DataFrame . Is there any way to build a DataFrame outside of Spark since it seems like one needs a SparkContext to build a DataFrame? Am I missing something? 回答1: Re: Is there any way to build a DataFrame outside of Spark? It is

Spark not utilizing all the core while running LinearRegressionwithSGD

阅读更多关于 Spark not utilizing all the core while running LinearRegressionwithSGD

问题 I am running Spark on my local machine (16G,8 cpu cores). I was trying to train linear regression model on dataset of size 300MB. I checked the cpu statistics and also the programs running, it just executes one thread. The documentation says they have implemented distributed version of SGD. http://spark.apache.org/docs/latest/mllib-linear-methods.html#implementation-developer from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD, LinearRegressionModel from pyspark import

Spark: How to get probabilities and AUC for Bernoulli Naive Bayes?

阅读更多关于 Spark: How to get probabilities and AUC for Bernoulli Naive Bayes?

问题 I'm running a Bernoulli Naive Bayes using code: val splits = MyData.randomSplit(Array(0.75, 0.25), seed = 2L) val training = splits(0).cache() val test = splits(1) val model = NaiveBayes.train(training, lambda = 3.0, modelType = "bernoulli") My question is how can I get the probability of membership to class 0 (or 1) and count AUC. I want to get similar result to LogisticRegressionWithSGD or SVMWithSGD where I was using this code: val numIterations = 100 val model = SVMWithSGD.train(training,

SparkR from Rstudio - gives Error in invokeJava(isStatic = TRUE, className, methodName, …) :

阅读更多关于 SparkR from Rstudio - gives Error in invokeJava(isStatic = TRUE, className, methodName, …) :

I am using RStudio. After creating session if i try to create dataframe using R data it gives error. Sys.setenv(SPARK_HOME = "E:/spark-2.0.0-bin-hadoop2.7/spark-2.0.0-bin-hadoop2.7") Sys.setenv(HADOOP_HOME = "E:/winutils") .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths())) Sys.setenv('SPARKR_SUBMIT_ARGS'='"sparkr-shell"') library(SparkR) sparkR.session(sparkConfig = list(spark.sql.warehouse.dir="C:/Temp")) localDF <- data.frame(name=c("John", "Smith", "Sarah"), age=c(19, 23, 18)) df <- createDataFrame(localDF) ERROR : Error in invokeJava(isStatic = TRUE, className,