apache-spark-ml

How to vectorize DataFrame columns for ML algorithms?

大城市里の小女人 提交于 2019-11-26 17:07:49
问题 have a DataFrame with some categorical string values (e.g uuid|url|browser). I would to convert it in a double to execute an ML algorithm that accept double matrix. As convertion method I used StringIndexer (spark 1.4) that map my string values to double values, so I defined a function like this: def str(arg: String, df:DataFrame) : DataFrame = ( val indexer = new StringIndexer().setInputCol(arg).setOutputCol(arg+"_index") val newDF = indexer.fit(df).transform(df) return newDF ) Now the issue

How to train a ML model in sparklyr and predict new values on another dataframe?

萝らか妹 提交于 2019-11-26 16:47:45
问题 Consider the following example dtrain <- data_frame(text = c("Chinese Beijing Chinese", "Chinese Chinese Shanghai", "Chinese Macao", "Tokyo Japan Chinese"), doc_id = 1:4, class = c(1, 1, 1, 0)) dtrain_spark <- copy_to(sc, dtrain, overwrite = TRUE) > dtrain_spark # Source: table<dtrain> [?? x 3] # Database: spark_connection text doc_id class <chr> <int> <dbl> 1 Chinese Beijing Chinese 1 1 2 Chinese Chinese Shanghai 2 1 3 Chinese Macao 3 1 4 Tokyo Japan Chinese 4 0 Here I have the classic Naive

How to map features from the output of a VectorAssembler back to the column names in Spark ML?

删除回忆录丶 提交于 2019-11-26 15:49:29
问题 I'm trying to run a linear regression in PySpark and I want to create a table containing summary statistics such as coefficients, P-values and t-values for each column in my dataset. However, in order to train a linear regression model I had to create a feature vector using Spark's VectorAssembler , and now for each row I have a single feature vector and the target column. When I try to access Spark's in-built regression summary statistics, they give me a very raw list of numbers for each of

How to handle categorical features with spark-ml?

删除回忆录丶 提交于 2019-11-26 14:27:08
How do I handle categorical data with spark-ml and not spark-mllib ? Thought the documentation is not very clear, it seems that classifiers e.g. RandomForestClassifier , LogisticRegression , have a featuresCol argument, which specifies the name of the column of features in the DataFrame , and a labelCol argument, which specifies the name of the column of labeled classes in the DataFrame . Obviously I want to use more than one feature in my prediction, so I tried using the VectorAssembler to put all my features in a single vector under featuresCol . However, the VectorAssembler only accepts

How to save models from ML Pipeline to S3 or HDFS?

夙愿已清 提交于 2019-11-26 14:24:18
问题 I am trying to save thousands of models produced by ML Pipeline. As indicated in the answer here, the models can be saved as follows: import java.io._ def saveModel(name: String, model: PipelineModel) = { val oos = new ObjectOutputStream(new FileOutputStream(s"/some/path/$name")) oos.writeObject(model) oos.close } schools.zip(bySchoolArrayModels).foreach{ case (name, model) => saveModel(name, Model) } I have tried using s3://some/path/$name and /user/hadoop/some/path/$name as I would like the

Spark ML VectorAssembler returns strange output

て烟熏妆下的殇ゞ 提交于 2019-11-26 11:34:34
问题 I am experiencing a very strange behaviour from VectorAssembler and I was wondering if anyone else has seen this. My scenario is pretty straightforward. I parse data from a CSV file where I have some standard Int and Double fields and I also calculate some extra columns. My parsing function returns this: val joined = countPerChannel ++ countPerSource //two arrays of Doubles joined (label, orderNo, pageNo, Vectors.dense(joinedCounts)) My main function uses the parsing function like this: val

ALS model - how to generate full_u * v^t * v?

半世苍凉 提交于 2019-11-26 11:29:26
问题 I\'m trying to figure out how an ALS model can predict values for new users in between it being updated by a batch process. In my search, I came across this stackoverflow answer. I\'ve copied the answer below for the reader\'s convenience: You can get predictions for new users using the trained model (without updating it): To get predictions for a user in the model, you use its latent representation (vector u of size f (number of factors)), which is multiplied by the product latent factor

Spark Scala: How to convert Dataframe[vector] to DataFrame[f1:Double, …, fn: Double)]

江枫思渺然 提交于 2019-11-26 11:25:57
I just used Standard Scaler to normalize my features for a ML application. After selecting the scaled features, I want to convert this back to a dataframe of Doubles, though the length of my vectors are arbitrary. I know how to do it for a specific 3 features by using myDF.map{case Row(v: Vector) => (v(0), v(1), v(2))}.toDF("f1", "f2", "f3") but not for an arbitrary amount of features. Is there an easy way to do this? Example: val testDF = sc.parallelize(List(Vectors.dense(5D, 6D, 7D), Vectors.dense(8D, 9D, 10D), Vectors.dense(11D, 12D, 13D))).map(Tuple1(_)).toDF("scaledFeatures") val

Encode and assemble multiple features in PySpark

北战南征 提交于 2019-11-26 11:17:10
I have a Python class that I'm using to load and process some data in Spark. Among various things I need to do, I'm generating a list of dummy variables derived from various columns in a Spark dataframe. My problem is that I'm not sure how to properly define a User Defined Function to accomplish what I need. I do currently have a method that, when mapped over the underlying dataframe RDD, solves half the problem (remember that this is a method in a larger data_processor class): def build_feature_arr(self,table): # this dict has keys for all the columns for which I need dummy coding categories

How to create a custom Estimator in PySpark

懵懂的女人 提交于 2019-11-26 09:38:06
问题 I am trying to build a simple custom Estimator in PySpark MLlib. I have here that it is possible to write a custom Transformer but I am not sure how to do it on an Estimator . I also don\'t understand what @keyword_only does and why do I need so many setters and getters. Scikit-learn seem to have a proper document for custom models (see here but PySpark doesn\'t. Pseudo code of an example model: class NormalDeviation(): def __init__(self, threshold = 3): def fit(x, y=None): self.model = {\