apache-spark-mllib | 易学教程

how to use the linear regression of MLlib of apache spark?

阅读更多关于 how to use the linear regression of MLlib of apache spark?

问题 I'm new to the apache spark, and from the document of MLlib, i found a example of scala, but i really don't know scala, is anyone knows a example in java? thanks! the example code is import org.apache.spark.mllib.regression.LinearRegressionWithSGD import org.apache.spark.mllib.regression.LabeledPoint // Load and parse the data val data = sc.textFile("mllib/data/ridge-data/lpsa.data") val parsedData = data.map { line => val parts = line.split(',') LabeledPoint(parts(0).toDouble, parts(1).split

running pyspark.mllib on Ubuntu

阅读更多关于 running pyspark.mllib on Ubuntu

I'm trying to link Spark in python. Codes bellow is test.py , and I put it under ~/spark/python : from pyspark import SparkContext, SparkConf from pyspark.mllib.fpm import FPGrowth conf = SparkConf().setAppName(appName).setMaster(master) sc = SparkContext(conf=conf) data = sc.textFile("data/mllib/sample_fpgrowth.txt") transactions = data.map(lambda line: line.strip().split(' ')) model = FPGrowth.train(transactions, minSupport=0.2, numPartitions=10) result = model.freqItemsets().collect() for fi in result: print(fi) And I run python test.py get this error messge: Exception in thread "main" java

How to use CrossValidator to choose between different models

阅读更多关于 How to use CrossValidator to choose between different models

I know that I can use a CrossValidator to tune a single model. But what is the suggested approach for evaluating different models against each other? For example, say that I wanted to evaluate a LogisticRegression classifier against a LinearSVC classifier using CrossValidator . After familiarizing myself a bit with the API, I solved this problem by implementing a custom Estimator that wraps two or more estimators it can delegate to, where the selected estimator is controlled by a single Param[Int] . Here is the actual code: import org.apache.spark.ml.Estimator import org.apache.spark.ml.Model

pyspark: CrossValidator not work

阅读更多关于 pyspark: CrossValidator not work

问题 I'm trying to tune the parameters of an ALS but always choose the first parameter as best option from pyspark.sql import SQLContext from pyspark import SparkConf, SparkContext from pyspark.ml.recommendation import ALS from pyspark.ml.tuning import CrossValidator, ParamGridBuilder from pyspark.ml.evaluation import RegressionEvaluator from math import sqrt from operator import add conf = (SparkConf() .setMaster("local[4]") .setAppName("Myapp") .set("spark.executor.memory", "2g")) sc =

How to get Precision/Recall using CrossValidator for training NaiveBayes Model using Spark

阅读更多关于 How to get Precision/Recall using CrossValidator for training NaiveBayes Model using Spark

问题 Supossed I have a Pipeline like this: val tokenizer = new Tokenizer().setInputCol("tweet").setOutputCol("words") val hashingTF = new HashingTF().setNumFeatures(1000).setInputCol("words").setOutputCol("features") val idf = new IDF().setInputCol("features").setOutputCol("idffeatures") val nb = new org.apache.spark.ml.classification.NaiveBayes() val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, idf, nb)) val paramGrid = new ParamGridBuilder().addGrid(hashingTF.numFeatures,

Preparing data for LDA in spark

阅读更多关于 Preparing data for LDA in spark

I'm working on implementing a Spark LDA model (via the Scala API), and am having trouble with the necessary formatting steps for my data. My raw data (stored in a text file) is in the following format, essentially a list of tokens and the documents they correspond to. A simplified example: doc XXXXX term XXXXX 1 x 'a' x 1 x 'a' x 1 x 'b' x 2 x 'b' x 2 x 'd' x ... Where the XXXXX columns are garbage data I don't care about. I realize this is an atypical way of storing corpus data, but it's what I have. As is I hope is clear from the example, there's one line per token in the raw data (so if a

Tagging columns as Categorical in Spark

阅读更多关于 Tagging columns as Categorical in Spark

I am currently using StringIndexer to convert lot of columns into unique integers for classification in RandomForestModel. I am also using a pipeline for the ML process. Some queries are How does the RandomForestModel know which columns are categorical. StringIndexer converts non--numerical to numerical but does it add some meta-data of somesort to indicate that it is a categorical column? In mllib.tree.RF there was parameter call categoricalInfo which indicated columns which are categorical. How does ml.tree.RF know which are since that is not present. Also, StringIndexer maps categories to

Cross Validation metrics with Pyspark

阅读更多关于 Cross Validation metrics with Pyspark

When we do a k-fold Cross Validation we are testing how well a model behaves when it comes to predict data it has never seen. If split my dataset in 90% training and 10% test and analyse the model performance, there is no guarantee that my test set doesn't contain only the 10% "easiest" or "hardest" points to predict. By doing a 10-fold cross validation I can be assured that every point will at least be used once for training. As (in this case) the model will be tested 10 times we can do an analysis of those tests metrics which will provide us with a better understanding of how the model is

How to build Spark Mllib submodule individually

阅读更多关于 How to build Spark Mllib submodule individually

I modified the mllib in Spark and want to use the customized mllib jar in other projects. It works when I build spark using: build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package learned from Spark's document at http://spark.apache.org/docs/latest/building-spark.html#building-submodules-individually . But building the whole Spark package took quite long (about 7 minutes on my desktop) so I would like to just build the mllib alone. The instruction for building a submodule in Spark is also from the same link and I used: build/mvn -pl :spark-mllib_2.10 clean install to

How to use long user ID in PySpark ALS

阅读更多关于 How to use long user ID in PySpark ALS

I am attempting to use long user/product IDs in the ALS model in PySpark MLlib (1.3.1) and have run into an issue. A simplified version of the code is given here: from pyspark import SparkContext from pyspark.mllib.recommendation import ALS, Rating sc = SparkContext("","test") # Load and parse the data d = [ "3661636574,1,1","3661636574,2,2","3661636574,3,3"] data = sc.parallelize(d) ratings = data.map(lambda l: l.split(',')).map(lambda l: Rating(long(l[0]), long(l[1]), float(l[2])) ) # Build the recommendation model using Alternating Least Squares rank = 10 numIterations = 20 model = ALS