apache-spark-mllib | 易学教程

Get Column Names after columnSimilarties() Spark scala

阅读更多关于 Get Column Names after columnSimilarties() Spark scala

问题 I'm trying to build item based collaborative filtering model with columnSimilarities() in spark. After using the columnsSimilarities() I want to assign the original column names back to the results in Spark scala. Runnable code to calculate columnSimilarities() on data frame. Data // rdd val rowsRdd: RDD[Row] = sc.parallelize( Seq( Row(2.0, 7.0, 1.0), Row(3.5, 2.5, 0.0), Row(7.0, 5.9, 0.0) ) ) // Schema val schema = new StructType() .add(StructField("item_1", DoubleType, true)) .add

Convert JavaPairRDD to JavaRDD

阅读更多关于 Convert JavaPairRDD to JavaRDD

问题 I am fetching data from ElsticSearch using ElasticSearch-Hadoop Library. JavaPairRDD<String, Map<String, Object>> esRDD = JavaEsSpark.esRDD(sc); Now I have JavaPairRDD. I want to use Random Forest from MLLib on this RDD. So I am converting it to JavaPairRDD.toRDD(esRDD) this will give me RDD. Using RDD I am converting again to JavaRDD JavaRDD<LabeledPoint>[] splits = (JavaRDD.fromRDD(JavaPairRDD.toRDD(esRDD), esRDD.classTag())).randomSplit(new double[] { 0.5, 0.5 }); JavaRDD<LabeledPoint>

running pyspark.mllib on Ubuntu

阅读更多关于 running pyspark.mllib on Ubuntu

问题 I'm trying to link Spark in python. Codes bellow is test.py , and I put it under ~/spark/python : from pyspark import SparkContext, SparkConf from pyspark.mllib.fpm import FPGrowth conf = SparkConf().setAppName(appName).setMaster(master) sc = SparkContext(conf=conf) data = sc.textFile("data/mllib/sample_fpgrowth.txt") transactions = data.map(lambda line: line.strip().split(' ')) model = FPGrowth.train(transactions, minSupport=0.2, numPartitions=10) result = model.freqItemsets().collect() for

How to calculate p-values in Spark's Logistic Regression?

阅读更多关于 How to calculate p-values in Spark's Logistic Regression?

问题 We are using LogisticRegressionWithSGD and would like to figure out which of our variables predict and with what significance. Some stats packages (StatsModels) return p-values for each term. A low p-value (< 0.05) indicates a meaningful addition to the model. How can we get/calculate p-values from LogisticRegressionWithSGD model? Any help with this is appreciated. 回答1: This is a very old question, but some guidance for people coming to it late might be valuable. LogisticRegressionWithSGD is

Computing Pointwise Mutual Information in Spark

阅读更多关于 Computing Pointwise Mutual Information in Spark

问题 I'm trying to compute pointwise mutual information (PMI). I have two RDDs as defined here for p(x, y) and p(x) respectively: pii: RDD[((String, String), Double)] pi: RDD[(String, Double)] Any code I'm writing to compute PMI from the RDDs pii and pi is not pretty. My approach is first to flatten the RDD pii and join with pi twice while massaging the tuple elements. val pmi = pii.map(x => (x._1._1, (x._1._2, x._1, x._2))) .join(pi).values .map(x => (x._1._1, (x._1._2, x._1._3, x._2))) .join(pi)

how to add a Incremental column ID for a table in spark SQL

阅读更多关于 how to add a Incremental column ID for a table in spark SQL

问题 I'm working on a spark mllib algorithm. The dataset I have is in this form Company":"XXXX","CurrentTitle":"XYZ","Edu_Title":"ABC","Exp_mnth":.(there are more values similar to these) Im trying to raw code String values to Numeric values. So, I tried using zipwithuniqueID for unique value for each of the string values.For some reason I'm not able to save the modified dataset to the disk. Can I do this in any way using spark SQL? or what would be the better approach for this? 回答1: Scala val

How to convert from org.apache.spark.mllib.linalg.VectorUDT to ml.linalg.VectorUDT

阅读更多关于 How to convert from org.apache.spark.mllib.linalg.VectorUDT to ml.linalg.VectorUDT

问题 I am using Spark cluster 2.0 and I would like to convert a vector from org.apache.spark.mllib.linalg.VectorUDT to org.apache.spark.ml.linalg.VectorUDT . # Import LinearRegression class from pyspark.ml.regression import LinearRegression # Define LinearRegression algorithm lr = LinearRegression() modelA = lr.fit(data, {lr.regParam:0.0}) Error: u'requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually org.apache.spark.mllib.linalg

how to keep records information when working in Mllib

阅读更多关于 how to keep records information when working in Mllib

问题 I'm working on a classification problem in which I have to use mllib library. The classification algorithms (let's say Logistic Regression) in mllib require an RDD[LabeledPoint]. A LabeledPoint has only two fields, a label and a feature vector. When doing the scoring (applying my trained model on the test set), my test instances have a few other fields that I'd like to keep. For example, a test instance looks like this <id, field1, field2, label, features> . When I create an RDD of

Understanding Spark MLlib LDA input format

阅读更多关于 Understanding Spark MLlib LDA input format

问题 I am trying to implement LDA using Spark MLlib. But I am having difficulty understanding input format. I was able to run its sample implementation to take input from a file which contains only number's as shown : 1 2 6 0 2 3 1 1 0 0 3 1 3 0 1 3 0 0 2 0 0 1 1 4 1 0 0 4 9 0 1 2 0 2 1 0 3 0 0 5 0 2 3 9 3 1 1 9 3 0 2 0 0 1 3 4 2 0 3 4 5 1 1 1 4 0 2 1 0 3 0 0 5 0 2 2 9 1 1 1 9 2 1 2 0 0 1 3 4 4 0 3 4 2 1 3 0 0 0 2 8 2 0 3 0 2 0 2 7 2 1 1 1 9 0 2 2 0 0 3 3 4 1 0 0 4 5 1 3 0 1 0 I followed http:/

After installing sparknlp, cannot import sparknlp

阅读更多关于 After installing sparknlp, cannot import sparknlp

The following ran successfully on a Cloudera CDSW cluster gateway. import pyspark from pyspark.sql import SparkSession spark = (SparkSession .builder .config("spark.jars.packages","JohnSnowLabs:spark-nlp:1.2.3") .getOrCreate() ) Which produces this output. Ivy Default Cache set to: /home/cdsw/.ivy2/cache The jars for the packages stored in: /home/cdsw/.ivy2/jars :: loading settings :: url = jar:file:/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml JohnSnowLabs#spark-nlp added as a dependency ::