apache-spark-mllib

Apache Spark MLlib with DataFrame API gives java.net.URISyntaxException when createDataFrame() or read().csv(…)

守給你的承諾、 提交于 2019-12-08 08:29:28
In a standalone application (runs on java8, Windows 10 with spark-xxx_2.11:2.0.0 as jar dependencies) next code gives an error: /* this: */ Dataset<Row> logData = spark_session.createDataFrame(Arrays.asList( new LabeledPoint(1.0, Vectors.dense(4.9,3,1.4,0.2)), new LabeledPoint(1.0, Vectors.dense(4.7,3.2,1.3,0.2)) ), LabeledPoint.class); /* or this: */ /* logFile: "C:\files\project\file.csv", "C:\\files\\project\\file.csv", "C:/files/project/file.csv", "file:/C:/files/project/file.csv", "file:///C:/files/project/file.csv", "/file.csv" */ Dataset<Row> logData = spark_session.read().csv(logFile);

How to groupby and aggregate multiple fields using RDD?

我只是一个虾纸丫 提交于 2019-12-08 08:26:25
问题 I am new to Apache Spark as well as Scala, currently learning this framework and programming language for big data. I have a sample file I am trying to find out for a given field total number of another field and its count and list of values from another field. I tried on my own and seems that i am not writing in better approach in spark rdd (as starting). Please find the below sample data (Customerid: Int, Orderid: Int, Amount: Float) : 44,8602,37.19 35,5368,65.89 2,3391,40.64 47,6694,14.98

How to print the probability of prediction in LogisticRegressionWithLBFGS for pyspark

眉间皱痕 提交于 2019-12-08 08:01:57
问题 I am using Spark 1.5.1 and, In pyspark, after I fit the model using: model = LogisticRegressionWithLBFGS.train(parsedData) I can print the prediction using: model.predict(p.features) Is there a function to print the probability score also along with the prediction? 回答1: You have to clear the threshold first, and this works only for binary classification: from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel from pyspark.mllib.regression import

Spark MLLib's LassoWithSGD doesn't scale?

…衆ロ難τιáo~ 提交于 2019-12-08 07:37:28
问题 I have code similar to what follows: val fileContent = sc.textFile("file:///myfile") val dataset = fileContent.map(row => { val explodedRow = row.split(",").map(s => s.toDouble) new LabeledPoint(explodedRow(13), Vectors.dense( Array(explodedRow(10), explodedRow(11), explodedRow(12)) ))}) val algo = new LassoWithSGD().setIntercept(true) val lambda = 0.0 algo.optimizer.setRegParam(lambda) algo.optimizer.setNumIterations(100) algo.optimizer.setStepSize(1.0) val model = algo.run(dataset) I'm

Spark - MLlib linear regression intercept and weight NaN [duplicate]

a 夏天 提交于 2019-12-08 07:24:26
问题 This question already has answers here : Spark MlLib linear regression (Linear least squares) giving random results (2 answers) Closed 3 years ago . I have trying to build a regression model on Spark using some custom data and the intercept and weights are always nan . This is my data: data = [LabeledPoint(0.0, [27022.0]), LabeledPoint(1.0, [27077.0]), LabeledPoint(2.0, [27327.0]), LabeledPoint(3.0, [27127.0])] Output: (weights=[nan], intercept=nan) However, if I use this dataset (taken from

How to use QuantileDiscretizer across groups in a DataFrame?

人盡茶涼 提交于 2019-12-08 05:59:46
问题 I have a DataFrame with the following columns. scala> show_times.printSchema root |-- account: string (nullable = true) |-- channel: string (nullable = true) |-- show_name: string (nullable = true) |-- total_time_watched: integer (nullable = true) This is data about how many times customer has watched watched a particular show. I'm supposed to categorize the customer for each show based on total time watched. The dataset has 133 million rows in total with 192 distinct show_names . For each

How to use long user ID in PySpark ALS

血红的双手。 提交于 2019-12-08 03:02:09
问题 I am attempting to use long user/product IDs in the ALS model in PySpark MLlib (1.3.1) and have run into an issue. A simplified version of the code is given here: from pyspark import SparkContext from pyspark.mllib.recommendation import ALS, Rating sc = SparkContext("","test") # Load and parse the data d = [ "3661636574,1,1","3661636574,2,2","3661636574,3,3"] data = sc.parallelize(d) ratings = data.map(lambda l: l.split(',')).map(lambda l: Rating(long(l[0]), long(l[1]), float(l[2])) ) # Build

Spark Streaming - Best way to Split Input Stream based on filter Param

こ雲淡風輕ζ 提交于 2019-12-08 02:38:05
问题 I currently try to create some kind of monitoring solution - some data is written to kafka and I read this data with Spark Streaming and process it. For preprocessing the data for machine learning and anomaly detection I would like to split the stream based on some filter Parameters. So far I have learned that DStreams themselves cannot be split into several streams. The problem I am mainly facing is that many algorithms(like KMeans) only take continues data and not discrete data like e.g.

Cross Validation metrics with Pyspark

巧了我就是萌 提交于 2019-12-08 02:34:27
问题 When we do a k-fold Cross Validation we are testing how well a model behaves when it comes to predict data it has never seen. If split my dataset in 90% training and 10% test and analyse the model performance, there is no guarantee that my test set doesn't contain only the 10% "easiest" or "hardest" points to predict. By doing a 10-fold cross validation I can be assured that every point will at least be used once for training. As (in this case) the model will be tested 10 times we can do an

How to build Spark Mllib submodule individually

南楼画角 提交于 2019-12-08 00:02:00
问题 I modified the mllib in Spark and want to use the customized mllib jar in other projects. It works when I build spark using: build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package learned from Spark's document at http://spark.apache.org/docs/latest/building-spark.html#building-submodules-individually. But building the whole Spark package took quite long (about 7 minutes on my desktop) so I would like to just build the mllib alone. The instruction for building a