apache-spark-ml

Why spark.ml don't implement any of spark.mllib algorithms?

痞子三分冷 提交于 2019-11-30 17:32:47
Following the Spark MLlib Guide we can read that Spark has two machine learning libraries: spark.mllib , built on top of RDDs. spark.ml , built on top of Dataframes. According to this and this question on StackOverflow, Dataframes are better (and newer) than RDDs and should be used whenever possible. The problem is that I want to use common machine learning algorithms (e.g: Frequent Pattern Mining , Naive Bayes , etc.) and spark.ml (for dataframes) don't provide such methods, only spark.mllib (for RDDs) provides this algorithms. If Dataframes are better than RDDs and the referred guide

Pyspark - Get all parameters of models created with ParamGridBuilder

北城余情 提交于 2019-11-30 14:57:08
I'm using PySpark 2.0 for a Kaggle competition. I'd like to know the behavior of a model ( RandomForest ) depending on different parameters. ParamGridBuilder() allows to specify different values for a single parameters, and then perform (I guess) a Cartesian product of the entire set of parameters. Assuming my DataFrame is already defined: rdc = RandomForestClassifier() pipeline = Pipeline(stages=STAGES + [rdc]) paramGrid = ParamGridBuilder().addGrid(rdc.maxDepth, [3, 10, 20]) .addGrid(rdc.minInfoGain, [0.01, 0.001]) .addGrid(rdc.numTrees, [5, 10, 20, 30]) .build() evaluator =

ALS model - predicted full_u * v^t * v ratings are very high

半腔热情 提交于 2019-11-30 13:41:46
I'm predicting ratings in between processes that batch train the model. I'm using the approach outlined here: ALS model - how to generate full_u * v^t * v? ! rm -rf ml-1m.zip ml-1m ! wget --quiet http://files.grouplens.org/datasets/movielens/ml-1m.zip ! unzip ml-1m.zip ! mv ml-1m/ratings.dat . from pyspark.mllib.recommendation import Rating ratingsRDD = sc.textFile('ratings.dat') \ .map(lambda l: l.split("::")) \ .map(lambda p: Rating( user = int(p[0]), product = int(p[1]), rating = float(p[2]), )).cache() from pyspark.mllib.recommendation import ALS rank = 50 numIterations = 20 lambdaParam =

pyspark : NameError: name 'spark' is not defined

爱⌒轻易说出口 提交于 2019-11-30 11:53:03
问题 I am copying the pyspark.ml example from the official document website: http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.Transformer data = [(Vectors.dense([0.0, 0.0]),), (Vectors.dense([1.0, 1.0]),),(Vectors.dense([9.0, 8.0]),), (Vectors.dense([8.0, 9.0]),)] df = spark.createDataFrame(data, ["features"]) kmeans = KMeans(k=2, seed=1) model = kmeans.fit(df) However, the example above wouldn't run and gave me the following errors: ---------------------------------------

How to access individual trees in a model created by RandomForestClassifier (spark.ml-version)?

馋奶兔 提交于 2019-11-30 10:06:34
How to access individual trees in a model generated by Spark ML's RandomForestClassifier ? I am using the Scala version of RandomForestClassifier. Actually it has trees attribute: import org.apache.spark.ml.attribute.NominalAttribute import org.apache.spark.ml.classification.{ RandomForestClassificationModel, RandomForestClassifier, DecisionTreeClassificationModel } val meta = NominalAttribute .defaultAttr .withName("label") .withValues("0.0", "1.0") .toMetadata val data = sqlContext.read.format("libsvm") .load("data/mllib/sample_libsvm_data.txt") .withColumn("label", $"label".as("label", meta

How to convert ArrayType to DenseVector in PySpark DataFrame?

馋奶兔 提交于 2019-11-30 07:01:17
问题 I'm getting the following error trying to build a ML Pipeline : pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually ArrayType(DoubleType,true).' My features column contains an array of floating point values. It sounds like I need to convert those to some type of vector (it's not sparse, so a DenseVector?). Is there a way to do this directly on the DataFrame or do I need to convert to an

Whether we can update existing model in spark-ml/spark-mllib?

 ̄綄美尐妖づ 提交于 2019-11-30 06:03:47
问题 We are using spark-ml to build the model from existing data. New data comes on daily basis. Is there a way that we can only read the new data and update the existing model without having to read all the data and retrain every time? 回答1: It depends on the model you're using but for some Spark does exactly what you want. You can look at StreamingKMeans, StreamingLinearRegressionWithSGD, StreamingLogisticRegressionWithSGD and more broadly StreamingLinearAlgorithm. 回答2: To complete Florent's

Apache Spark throws NullPointerException when encountering missing feature

别来无恙 提交于 2019-11-30 06:00:44
问题 I have a bizarre issue with PySpark when indexing column of strings in features. Here is my tmp.csv file: x0,x1,x2,x3 asd2s,1e1e,1.1,0 asd2s,1e1e,0.1,0 ,1e3e,1.2,0 bd34t,1e1e,5.1,1 asd2s,1e3e,0.2,0 bd34t,1e2e,4.3,1 where I have one missing value for 'x0'. At first, I'm reading features from csv file into DataFrame using pyspark_csv: https://github.com/seahboonsiew/pyspark-csv then indexing x0 with StringIndexer: import pyspark_csv as pycsv from pyspark.ml.feature import StringIndexer sc

PySpark: How to get classification probabilities from MultilayerPerceptronClassifier?

試著忘記壹切 提交于 2019-11-30 04:20:27
问题 I'm using Spark 2.0.1 in python, my dataset is in DataFrame, so I'm using the ML (not MLLib) library for machine learning. I have a multilayer perceptron classifier and I have only two labels. My question is, is it possible to get not only the labels, but also (or only) the probability for that label? Like not just 0 or 1 for every input, but something like 0.95 for 0 and 0.05 for 1. If this is not possible with MLP, but is possible with other classifier, I can change the classifier. I have

How do I convert an RDD with a SparseVector Column to a DataFrame with a column as Vector

白昼怎懂夜的黑 提交于 2019-11-30 03:40:44
I have an RDD with a tuple of values (String, SparseVector) and I want to create a DataFrame using the RDD . To get a (label:string, features:vector) DataFrame which is the Schema required by most of the ml algorithm's libraries. I know it can be done because HashingTF ml Library outputs a vector when given a features column of a DataFrame . temp_df = sqlContext.createDataFrame(temp_rdd, StructType([ StructField("label", DoubleType(), False), StructField("tokens", ArrayType(StringType()), False) ])) #assumming there is an RDD (double,array(strings)) hashingTF = HashingTF(numFeatures