apache-spark-ml | 易学教程

Why spark.ml don't implement any of spark.mllib algorithms?

阅读更多关于 Why spark.ml don't implement any of spark.mllib algorithms?

Following the Spark MLlib Guide we can read that Spark has two machine learning libraries: spark.mllib , built on top of RDDs. spark.ml , built on top of Dataframes. According to this and this question on StackOverflow, Dataframes are better (and newer) than RDDs and should be used whenever possible. The problem is that I want to use common machine learning algorithms (e.g: Frequent Pattern Mining , Naive Bayes , etc.) and spark.ml (for dataframes) don't provide such methods, only spark.mllib (for RDDs) provides this algorithms. If Dataframes are better than RDDs and the referred guide

Pyspark - Get all parameters of models created with ParamGridBuilder

阅读更多关于 Pyspark - Get all parameters of models created with ParamGridBuilder

I'm using PySpark 2.0 for a Kaggle competition. I'd like to know the behavior of a model ( RandomForest ) depending on different parameters. ParamGridBuilder() allows to specify different values for a single parameters, and then perform (I guess) a Cartesian product of the entire set of parameters. Assuming my DataFrame is already defined: rdc = RandomForestClassifier() pipeline = Pipeline(stages=STAGES + [rdc]) paramGrid = ParamGridBuilder().addGrid(rdc.maxDepth, [3, 10, 20]) .addGrid(rdc.minInfoGain, [0.01, 0.001]) .addGrid(rdc.numTrees, [5, 10, 20, 30]) .build() evaluator =

ALS model - predicted full_u * v^t * v ratings are very high

阅读更多关于 ALS model - predicted full_u * v^t * v ratings are very high

I'm predicting ratings in between processes that batch train the model. I'm using the approach outlined here: ALS model - how to generate full_u * v^t * v? ! rm -rf ml-1m.zip ml-1m ! wget --quiet http://files.grouplens.org/datasets/movielens/ml-1m.zip ! unzip ml-1m.zip ! mv ml-1m/ratings.dat . from pyspark.mllib.recommendation import Rating ratingsRDD = sc.textFile('ratings.dat') \ .map(lambda l: l.split("::")) \ .map(lambda p: Rating( user = int(p[0]), product = int(p[1]), rating = float(p[2]), )).cache() from pyspark.mllib.recommendation import ALS rank = 50 numIterations = 20 lambdaParam =

pyspark : NameError: name 'spark' is not defined

阅读更多关于 pyspark : NameError: name 'spark' is not defined

问题 I am copying the pyspark.ml example from the official document website: http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.Transformer data = [(Vectors.dense([0.0, 0.0]),), (Vectors.dense([1.0, 1.0]),),(Vectors.dense([9.0, 8.0]),), (Vectors.dense([8.0, 9.0]),)] df = spark.createDataFrame(data, ["features"]) kmeans = KMeans(k=2, seed=1) model = kmeans.fit(df) However, the example above wouldn't run and gave me the following errors: ---------------------------------------

How to access individual trees in a model created by RandomForestClassifier (spark.ml-version)?

阅读更多关于 How to access individual trees in a model created by RandomForestClassifier (spark.ml-version)?

How to access individual trees in a model generated by Spark ML's RandomForestClassifier ? I am using the Scala version of RandomForestClassifier. Actually it has trees attribute: import org.apache.spark.ml.attribute.NominalAttribute import org.apache.spark.ml.classification.{ RandomForestClassificationModel, RandomForestClassifier, DecisionTreeClassificationModel } val meta = NominalAttribute .defaultAttr .withName("label") .withValues("0.0", "1.0") .toMetadata val data = sqlContext.read.format("libsvm") .load("data/mllib/sample_libsvm_data.txt") .withColumn("label", $"label".as("label", meta

How to convert ArrayType to DenseVector in PySpark DataFrame?

阅读更多关于 How to convert ArrayType to DenseVector in PySpark DataFrame?

问题 I'm getting the following error trying to build a ML Pipeline : pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually ArrayType(DoubleType,true).' My features column contains an array of floating point values. It sounds like I need to convert those to some type of vector (it's not sparse, so a DenseVector?). Is there a way to do this directly on the DataFrame or do I need to convert to an

Whether we can update existing model in spark-ml/spark-mllib?

阅读更多关于 Whether we can update existing model in spark-ml/spark-mllib?

问题 We are using spark-ml to build the model from existing data. New data comes on daily basis. Is there a way that we can only read the new data and update the existing model without having to read all the data and retrain every time? 回答1: It depends on the model you're using but for some Spark does exactly what you want. You can look at StreamingKMeans, StreamingLinearRegressionWithSGD, StreamingLogisticRegressionWithSGD and more broadly StreamingLinearAlgorithm. 回答2: To complete Florent's

Apache Spark throws NullPointerException when encountering missing feature

阅读更多关于 Apache Spark throws NullPointerException when encountering missing feature

问题 I have a bizarre issue with PySpark when indexing column of strings in features. Here is my tmp.csv file: x0,x1,x2,x3 asd2s,1e1e,1.1,0 asd2s,1e1e,0.1,0 ,1e3e,1.2,0 bd34t,1e1e,5.1,1 asd2s,1e3e,0.2,0 bd34t,1e2e,4.3,1 where I have one missing value for 'x0'. At first, I'm reading features from csv file into DataFrame using pyspark_csv: https://github.com/seahboonsiew/pyspark-csv then indexing x0 with StringIndexer: import pyspark_csv as pycsv from pyspark.ml.feature import StringIndexer sc

PySpark: How to get classification probabilities from MultilayerPerceptronClassifier?

阅读更多关于 PySpark: How to get classification probabilities from MultilayerPerceptronClassifier?

问题 I'm using Spark 2.0.1 in python, my dataset is in DataFrame, so I'm using the ML (not MLLib) library for machine learning. I have a multilayer perceptron classifier and I have only two labels. My question is, is it possible to get not only the labels, but also (or only) the probability for that label? Like not just 0 or 1 for every input, but something like 0.95 for 0 and 0.05 for 1. If this is not possible with MLP, but is possible with other classifier, I can change the classifier. I have

How do I convert an RDD with a SparseVector Column to a DataFrame with a column as Vector

阅读更多关于 How do I convert an RDD with a SparseVector Column to a DataFrame with a column as Vector

I have an RDD with a tuple of values (String, SparseVector) and I want to create a DataFrame using the RDD . To get a (label:string, features:vector) DataFrame which is the Schema required by most of the ml algorithm's libraries. I know it can be done because HashingTF ml Library outputs a vector when given a features column of a DataFrame . temp_df = sqlContext.createDataFrame(temp_rdd, StructType([ StructField("label", DoubleType(), False), StructField("tokens", ArrayType(StringType()), False) ])) #assumming there is an RDD (double,array(strings)) hashingTF = HashingTF(numFeatures