apache-spark-ml | 易学教程

Spark MLlib example, NoSuchMethodError: org.apache.spark.sql.SQLContext.createDataFrame()

阅读更多关于 Spark MLlib example, NoSuchMethodError: org.apache.spark.sql.SQLContext.createDataFrame()

问题 I'm following the documentation example Example: Estimator, Transformer, and Param And I got error msg 15/09/23 11:46:51 INFO BlockManagerMaster: Registered BlockManager Exception in thread "main" java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaUniverse$JavaMirror; at SimpleApp$.main(hw.scala:75) And line 75 is the code "sqlContext.createDataFrame()": import java.util.Random import org.apache.log4j.Logger import org

Online (incremental) logistic regression in Spark [duplicate]

阅读更多关于 Online (incremental) logistic regression in Spark [duplicate]

问题 This question already has answers here : Whether we can update existing model in spark-ml/spark-mllib? (2 answers) Closed 11 months ago . In Spark MLlib (RDD-based API) there is the StreamingLogisticRegressionWithSGD for incremental training of a Logistic Regression model. However, this class has been deprecated and offers little functionality (eg no access to model coefficients and output probabilities). In Spark ML (DataFrame-based API) I only find the class LogisticRegression , having only

Preserve index-string correspondence spark string indexer

阅读更多关于 Preserve index-string correspondence spark string indexer

问题 Spark's StringIndexer is quite useful, but it's common to need to retrieve the correspondences between the generated index values and the original strings, and it seems like there should be a built-in way to accomplish this. I'll illustrate using this simple example from the Spark documentation: from pyspark.ml.feature import StringIndexer df = sqlContext.createDataFrame( [(0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")], ["id", "category"]) indexer = StringIndexer(inputCol=

Join two Spark mllib pipelines together

阅读更多关于 Join two Spark mllib pipelines together

问题 I have two separate DataFrames which each have several differing processing stages which I use mllib transformers in a pipeline to handle. I now want to join these two pipelines together, keeping the features (columns) from each DataFrame . Scikit-learn has the FeatureUnion class for handling this, and I can't seem to find an equivalent for mllib . I can add a custom transformer stage at the end of one pipeline that takes the DataFrame produced by the other pipeline as an attribute and join

ALS model - predicted full_u * v^t * v ratings are very high

阅读更多关于 ALS model - predicted full_u * v^t * v ratings are very high

问题 I'm predicting ratings in between processes that batch train the model. I'm using the approach outlined here: ALS model - how to generate full_u * v^t * v? ! rm -rf ml-1m.zip ml-1m ! wget --quiet http://files.grouplens.org/datasets/movielens/ml-1m.zip ! unzip ml-1m.zip ! mv ml-1m/ratings.dat . from pyspark.mllib.recommendation import Rating ratingsRDD = sc.textFile('ratings.dat') \ .map(lambda l: l.split("::")) \ .map(lambda p: Rating( user = int(p[0]), product = int(p[1]), rating = float(p[2

Exception on using VectorAssembler in apache spark ml

阅读更多关于 Exception on using VectorAssembler in apache spark ml

问题 I'm trying to create a vectorAssembler to create an input for logistic regression and am using the following code : //imports import org.apache.spark.ml.feature.VectorAssembler import org.apache.spark.mllib.linalg.{Vector, Vectors, VectorUDT} 1 val assembler = new VectorAssembler() 2 .setInputCols(flattenedPath.columns.diff(Seq("userid", "Conversion"))) 3 .setOutputCol("features") 4 val output = assembler.transform(flattenedPath) 5 println(output.select("features", "Conversion").first()) Im

Using Spark ML Pipelines just for Transformations

阅读更多关于 Using Spark ML Pipelines just for Transformations

问题 I am working on a project where configurable pipelines and lineage tracking of alterations to Spark DataFrames are both essential. The endpoints of this pipeline are usually just modified DataFrames (think of it as an ETL task). What made the most sense to me was to leverage the already existing Spark ML Pipeline API to track these alterations. In particular, the alterations (adding columns based on others, etc.) are implemented as custom Spark ML Transformers. However, we are now having an

pyspark 2.2.0 concept behind raw predictions field of logistic regression model

阅读更多关于 pyspark 2.2.0 concept behind raw predictions field of logistic regression model

问题 I was trying to understand the concept of the output generated from logistic regression model in Pyspark. Could anyone please explain the concept behind the rawPrediction field calculation generated from a logistic regression model? Thanks. 回答1: In older versions of the Spark javadocs (e.g. 1.5.x), there used to be the following explanation: The meaning of a "raw" prediction may vary between algorithms, but it intuitively gives a measure of confidence in each possible label (where larger =

XGBoost Spark One Model Per Worker Integration

阅读更多关于 XGBoost Spark One Model Per Worker Integration

问题 Trying to work through this notebook https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1526931011080774/3624187670661048/6320440561800420/latest.html. Using spark version 2.4.3 and xgboost 0.90 Keep getting this error ValueError: bad input shape () when trying to execute ... features = inputTrainingDF.select("features").collect() lables = inputTrainingDF.select("label").collect() X = np.asarray(map(lambda v: v[0].toArray(), features)) Y = np

Spark ML Pipeline api save not working

阅读更多关于 Spark ML Pipeline api save not working

问题 in version 1.6 the pipeline api got a new set of features to save and load pipeline stages. I tried to save a stage to disk after I trained a classifier and load it later again to reuse it and save the effort to compute to model again. For some reason when I save the model, the directory only contains the metadata directory. When I try to load it again I get the following exception: Exception in thread "main" java.lang.UnsupportedOperationException: empty collection at org.apache.spark.rdd