apache-spark-ml

Spark MLlib example, NoSuchMethodError: org.apache.spark.sql.SQLContext.createDataFrame()

若如初见. 提交于 2020-01-16 06:50:02
问题 I'm following the documentation example Example: Estimator, Transformer, and Param And I got error msg 15/09/23 11:46:51 INFO BlockManagerMaster: Registered BlockManager Exception in thread "main" java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaUniverse$JavaMirror; at SimpleApp$.main(hw.scala:75) And line 75 is the code "sqlContext.createDataFrame()": import java.util.Random import org.apache.log4j.Logger import org

Online (incremental) logistic regression in Spark [duplicate]

笑着哭i 提交于 2020-01-15 08:16:10
问题 This question already has answers here : Whether we can update existing model in spark-ml/spark-mllib? (2 answers) Closed 11 months ago . In Spark MLlib (RDD-based API) there is the StreamingLogisticRegressionWithSGD for incremental training of a Logistic Regression model. However, this class has been deprecated and offers little functionality (eg no access to model coefficients and output probabilities). In Spark ML (DataFrame-based API) I only find the class LogisticRegression , having only

Preserve index-string correspondence spark string indexer

China☆狼群 提交于 2020-01-12 01:44:06
问题 Spark's StringIndexer is quite useful, but it's common to need to retrieve the correspondences between the generated index values and the original strings, and it seems like there should be a built-in way to accomplish this. I'll illustrate using this simple example from the Spark documentation: from pyspark.ml.feature import StringIndexer df = sqlContext.createDataFrame( [(0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")], ["id", "category"]) indexer = StringIndexer(inputCol=

Join two Spark mllib pipelines together

♀尐吖头ヾ 提交于 2020-01-11 03:31:12
问题 I have two separate DataFrames which each have several differing processing stages which I use mllib transformers in a pipeline to handle. I now want to join these two pipelines together, keeping the features (columns) from each DataFrame . Scikit-learn has the FeatureUnion class for handling this, and I can't seem to find an equivalent for mllib . I can add a custom transformer stage at the end of one pipeline that takes the DataFrame produced by the other pipeline as an attribute and join

ALS model - predicted full_u * v^t * v ratings are very high

本小妞迷上赌 提交于 2020-01-10 14:15:31
问题 I'm predicting ratings in between processes that batch train the model. I'm using the approach outlined here: ALS model - how to generate full_u * v^t * v? ! rm -rf ml-1m.zip ml-1m ! wget --quiet http://files.grouplens.org/datasets/movielens/ml-1m.zip ! unzip ml-1m.zip ! mv ml-1m/ratings.dat . from pyspark.mllib.recommendation import Rating ratingsRDD = sc.textFile('ratings.dat') \ .map(lambda l: l.split("::")) \ .map(lambda p: Rating( user = int(p[0]), product = int(p[1]), rating = float(p[2

Exception on using VectorAssembler in apache spark ml

十年热恋 提交于 2020-01-07 08:25:26
问题 I'm trying to create a vectorAssembler to create an input for logistic regression and am using the following code : //imports import org.apache.spark.ml.feature.VectorAssembler import org.apache.spark.mllib.linalg.{Vector, Vectors, VectorUDT} 1 val assembler = new VectorAssembler() 2 .setInputCols(flattenedPath.columns.diff(Seq("userid", "Conversion"))) 3 .setOutputCol("features") 4 val output = assembler.transform(flattenedPath) 5 println(output.select("features", "Conversion").first()) Im

Using Spark ML Pipelines just for Transformations

谁说我不能喝 提交于 2020-01-06 03:34:32
问题 I am working on a project where configurable pipelines and lineage tracking of alterations to Spark DataFrames are both essential. The endpoints of this pipeline are usually just modified DataFrames (think of it as an ETL task). What made the most sense to me was to leverage the already existing Spark ML Pipeline API to track these alterations. In particular, the alterations (adding columns based on others, etc.) are implemented as custom Spark ML Transformers. However, we are now having an

pyspark 2.2.0 concept behind raw predictions field of logistic regression model

时间秒杀一切 提交于 2020-01-05 04:25:31
问题 I was trying to understand the concept of the output generated from logistic regression model in Pyspark. Could anyone please explain the concept behind the rawPrediction field calculation generated from a logistic regression model? Thanks. 回答1: In older versions of the Spark javadocs (e.g. 1.5.x), there used to be the following explanation: The meaning of a "raw" prediction may vary between algorithms, but it intuitively gives a measure of confidence in each possible label (where larger =

XGBoost Spark One Model Per Worker Integration

走远了吗. 提交于 2020-01-05 04:08:11
问题 Trying to work through this notebook https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1526931011080774/3624187670661048/6320440561800420/latest.html. Using spark version 2.4.3 and xgboost 0.90 Keep getting this error ValueError: bad input shape () when trying to execute ... features = inputTrainingDF.select("features").collect() lables = inputTrainingDF.select("label").collect() X = np.asarray(map(lambda v: v[0].toArray(), features)) Y = np

Spark ML Pipeline api save not working

谁说我不能喝 提交于 2020-01-03 05:47:06
问题 in version 1.6 the pipeline api got a new set of features to save and load pipeline stages. I tried to save a stage to disk after I trained a classifier and load it later again to reuse it and save the effort to compute to model again. For some reason when I save the model, the directory only contains the metadata directory. When I try to load it again I get the following exception: Exception in thread "main" java.lang.UnsupportedOperationException: empty collection at org.apache.spark.rdd