apache-spark-ml | 易学教程

How to vectorize DataFrame columns for ML algorithms?

阅读更多关于 How to vectorize DataFrame columns for ML algorithms?

问题 have a DataFrame with some categorical string values (e.g uuid|url|browser). I would to convert it in a double to execute an ML algorithm that accept double matrix. As convertion method I used StringIndexer (spark 1.4) that map my string values to double values, so I defined a function like this: def str(arg: String, df:DataFrame) : DataFrame = ( val indexer = new StringIndexer().setInputCol(arg).setOutputCol(arg+"_index") val newDF = indexer.fit(df).transform(df) return newDF ) Now the issue

How to train a ML model in sparklyr and predict new values on another dataframe?

阅读更多关于 How to train a ML model in sparklyr and predict new values on another dataframe?

问题 Consider the following example dtrain <- data_frame(text = c("Chinese Beijing Chinese", "Chinese Chinese Shanghai", "Chinese Macao", "Tokyo Japan Chinese"), doc_id = 1:4, class = c(1, 1, 1, 0)) dtrain_spark <- copy_to(sc, dtrain, overwrite = TRUE) > dtrain_spark # Source: table<dtrain> [?? x 3] # Database: spark_connection text doc_id class <chr> <int> <dbl> 1 Chinese Beijing Chinese 1 1 2 Chinese Chinese Shanghai 2 1 3 Chinese Macao 3 1 4 Tokyo Japan Chinese 4 0 Here I have the classic Naive

How to map features from the output of a VectorAssembler back to the column names in Spark ML?

阅读更多关于 How to map features from the output of a VectorAssembler back to the column names in Spark ML?

问题 I'm trying to run a linear regression in PySpark and I want to create a table containing summary statistics such as coefficients, P-values and t-values for each column in my dataset. However, in order to train a linear regression model I had to create a feature vector using Spark's VectorAssembler , and now for each row I have a single feature vector and the target column. When I try to access Spark's in-built regression summary statistics, they give me a very raw list of numbers for each of

How to handle categorical features with spark-ml?

阅读更多关于 How to handle categorical features with spark-ml?

How do I handle categorical data with spark-ml and not spark-mllib ? Thought the documentation is not very clear, it seems that classifiers e.g. RandomForestClassifier , LogisticRegression , have a featuresCol argument, which specifies the name of the column of features in the DataFrame , and a labelCol argument, which specifies the name of the column of labeled classes in the DataFrame . Obviously I want to use more than one feature in my prediction, so I tried using the VectorAssembler to put all my features in a single vector under featuresCol . However, the VectorAssembler only accepts

How to save models from ML Pipeline to S3 or HDFS?

阅读更多关于 How to save models from ML Pipeline to S3 or HDFS?

问题 I am trying to save thousands of models produced by ML Pipeline. As indicated in the answer here, the models can be saved as follows: import java.io._ def saveModel(name: String, model: PipelineModel) = { val oos = new ObjectOutputStream(new FileOutputStream(s"/some/path/$name")) oos.writeObject(model) oos.close } schools.zip(bySchoolArrayModels).foreach{ case (name, model) => saveModel(name, Model) } I have tried using s3://some/path/$name and /user/hadoop/some/path/$name as I would like the

Spark ML VectorAssembler returns strange output

阅读更多关于 Spark ML VectorAssembler returns strange output

问题 I am experiencing a very strange behaviour from VectorAssembler and I was wondering if anyone else has seen this. My scenario is pretty straightforward. I parse data from a CSV file where I have some standard Int and Double fields and I also calculate some extra columns. My parsing function returns this: val joined = countPerChannel ++ countPerSource //two arrays of Doubles joined (label, orderNo, pageNo, Vectors.dense(joinedCounts)) My main function uses the parsing function like this: val

ALS model - how to generate full_u * v^t * v?

阅读更多关于 ALS model - how to generate full_u * v^t * v?

问题 I\'m trying to figure out how an ALS model can predict values for new users in between it being updated by a batch process. In my search, I came across this stackoverflow answer. I\'ve copied the answer below for the reader\'s convenience: You can get predictions for new users using the trained model (without updating it): To get predictions for a user in the model, you use its latent representation (vector u of size f (number of factors)), which is multiplied by the product latent factor

Spark Scala: How to convert Dataframe[vector] to DataFrame[f1:Double, …, fn: Double)]

阅读更多关于 Spark Scala: How to convert Dataframe[vector] to DataFrame[f1:Double, …, fn: Double)]

I just used Standard Scaler to normalize my features for a ML application. After selecting the scaled features, I want to convert this back to a dataframe of Doubles, though the length of my vectors are arbitrary. I know how to do it for a specific 3 features by using myDF.map{case Row(v: Vector) => (v(0), v(1), v(2))}.toDF("f1", "f2", "f3") but not for an arbitrary amount of features. Is there an easy way to do this? Example: val testDF = sc.parallelize(List(Vectors.dense(5D, 6D, 7D), Vectors.dense(8D, 9D, 10D), Vectors.dense(11D, 12D, 13D))).map(Tuple1(_)).toDF("scaledFeatures") val

Encode and assemble multiple features in PySpark

阅读更多关于 Encode and assemble multiple features in PySpark

I have a Python class that I'm using to load and process some data in Spark. Among various things I need to do, I'm generating a list of dummy variables derived from various columns in a Spark dataframe. My problem is that I'm not sure how to properly define a User Defined Function to accomplish what I need. I do currently have a method that, when mapped over the underlying dataframe RDD, solves half the problem (remember that this is a method in a larger data_processor class): def build_feature_arr(self,table): # this dict has keys for all the columns for which I need dummy coding categories

How to create a custom Estimator in PySpark

阅读更多关于 How to create a custom Estimator in PySpark

问题 I am trying to build a simple custom Estimator in PySpark MLlib. I have here that it is possible to write a custom Transformer but I am not sure how to do it on an Estimator . I also don\'t understand what @keyword_only does and why do I need so many setters and getters. Scikit-learn seem to have a proper document for custom models (see here but PySpark doesn\'t. Pseudo code of an example model: class NormalDeviation(): def __init__(self, threshold = 3): def fit(x, y=None): self.model = {\