apache-spark-mllib

String similarity with OR condition in MinHash Spark ML

拥有回忆 提交于 2019-12-02 13:43:24
问题 I have two datasets, first one is large reference dataset and from second dataset will find best match from first dataset through MinHash algorithm. val dataset1 = +-------------+----------+------+------+-----------------------+ | x'| y'| a'| b'| dataString(x'+y'+a')| +-------------+----------+------+------+-----------------------+ | John| Smith| 55649| 28200| John|Smith|55649| | Emma| Morales| 78439| 34200| Emma|Morales|78439| | Janet| Alvarado| 89488| 29103| Janet|Alvarado|89488| |

Spark: StringIndexer on sentences

旧巷老猫 提交于 2019-12-02 10:30:09
问题 I am trying to do something StringIndexer on a column of sentences, i.e. transforming list of words to list of integers. For example: input dataset : (1, ["I", "like", "Spark"]) (2, ["I", "hate", "Spark"]) I expected the output after StringIndexer to be like: (1, [0, 2, 1]) (2, [0, 3, 1]) Ideally, I would like to make such transformation as part of Pipeline, so that I can chain couple transformer together and serialize for online serving. Is this something Spark support natively? Thank you!

Using CategoricalFeaturesInfo with DecisionTreeClassifier method in Spark

倾然丶 夕夏残阳落幕 提交于 2019-12-02 09:23:16
I have to use this code: val dt = new DecisionTreeClassifier().setLabelCol("indexedLabel").setFeaturesCol("indexedFeatures").setImpurity(impurity).setMaxBins(maxBins).setMaxDepth(maxDepth); I need to add categorical features information so that the decision tree doesn't treat the indexedCategoricalFeatures as numerical. I have this map: val categoricalFeaturesInfo = Map(143 -> 126, 144 -> 5, 145 -> 216, 146 -> 100, 147 -> 14, 148 -> 8, 149 -> 19, 150 -> 7); However it only works with DecisionTree.trainClassifier method. I can't use this method because it accepts different arguments than the

Text classification - how to approach

谁说我不能喝 提交于 2019-12-02 09:20:56
问题 I'll try do describe what I have in mind. There is a text content stored in MS SQL database. Content comes daily as a stream. Some people go through the content every day and, if the content fits certain criteria, mark it as validated. There is only one category. It's either "valid" or not. What I want is to create a model based on already validated content, save it and using this model to "pre-validate" or mark new incoming content. Also once in a while to update the model based on a newly

How to convert a mllib matrix to a spark dataframe?

岁酱吖の 提交于 2019-12-02 07:29:34
I want to pretty print the result of a correlation in a zeppelin notebook: val Row(coeff: Matrix) = Correlation.corr(data, "features").head One of the ways to achieve this is to convert the result into a DataFrame with each value in a separate column and call z.show() . However, looking into the Matrix api I don't see any way to do this. Is there another straight forward way to achieve this? Edit: The dataframe has 50 columns. Just converting to a string would not help as the output get truncated. Using the toString method should be the easiest and fastest way if you simply want to print the

String similarity with OR condition in MinHash Spark ML

…衆ロ難τιáo~ 提交于 2019-12-02 06:43:43
I have two datasets, first one is large reference dataset and from second dataset will find best match from first dataset through MinHash algorithm. val dataset1 = +-------------+----------+------+------+-----------------------+ | x'| y'| a'| b'| dataString(x'+y'+a')| +-------------+----------+------+------+-----------------------+ | John| Smith| 55649| 28200| John|Smith|55649| | Emma| Morales| 78439| 34200| Emma|Morales|78439| | Janet| Alvarado| 89488| 29103| Janet|Alvarado|89488| | Elizabeth| K| 36935| 38101| Elizabeth|K|36935| | Cristin| Cruz| 75716| 70015| Cristin|Cruz|75716| | Jack|

How to eval spark.ml model without DataFrames/SparkContext?

℡╲_俬逩灬. 提交于 2019-12-02 06:07:49
问题 With Spark MLLib, I'd build a model (like RandomForest ), and then it was possible to eval it outside of Spark by loading the model and using predict on it passing a vector of features. It seems like with Spark ML, predict is now called transform and only acts on a DataFrame . Is there any way to build a DataFrame outside of Spark since it seems like one needs a SparkContext to build a DataFrame? Am I missing something? 回答1: Re: Is there any way to build a DataFrame outside of Spark? It is

Spark not utilizing all the core while running LinearRegressionwithSGD

对着背影说爱祢 提交于 2019-12-02 05:30:11
问题 I am running Spark on my local machine (16G,8 cpu cores). I was trying to train linear regression model on dataset of size 300MB. I checked the cpu statistics and also the programs running, it just executes one thread. The documentation says they have implemented distributed version of SGD. http://spark.apache.org/docs/latest/mllib-linear-methods.html#implementation-developer from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD, LinearRegressionModel from pyspark import

Spark: How to get probabilities and AUC for Bernoulli Naive Bayes?

ε祈祈猫儿з 提交于 2019-12-02 05:14:57
问题 I'm running a Bernoulli Naive Bayes using code: val splits = MyData.randomSplit(Array(0.75, 0.25), seed = 2L) val training = splits(0).cache() val test = splits(1) val model = NaiveBayes.train(training, lambda = 3.0, modelType = "bernoulli") My question is how can I get the probability of membership to class 0 (or 1) and count AUC. I want to get similar result to LogisticRegressionWithSGD or SVMWithSGD where I was using this code: val numIterations = 100 val model = SVMWithSGD.train(training,

SparkR from Rstudio - gives Error in invokeJava(isStatic = TRUE, className, methodName, …) :

限于喜欢 提交于 2019-12-02 02:54:49
I am using RStudio. After creating session if i try to create dataframe using R data it gives error. Sys.setenv(SPARK_HOME = "E:/spark-2.0.0-bin-hadoop2.7/spark-2.0.0-bin-hadoop2.7") Sys.setenv(HADOOP_HOME = "E:/winutils") .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths())) Sys.setenv('SPARKR_SUBMIT_ARGS'='"sparkr-shell"') library(SparkR) sparkR.session(sparkConfig = list(spark.sql.warehouse.dir="C:/Temp")) localDF <- data.frame(name=c("John", "Smith", "Sarah"), age=c(19, 23, 18)) df <- createDataFrame(localDF) ERROR : Error in invokeJava(isStatic = TRUE, className,