apache-spark-mllib

Get Column Names after columnSimilarties() Spark scala

99封情书 提交于 2019-12-07 22:06:52
问题 I'm trying to build item based collaborative filtering model with columnSimilarities() in spark. After using the columnsSimilarities() I want to assign the original column names back to the results in Spark scala. Runnable code to calculate columnSimilarities() on data frame. Data // rdd val rowsRdd: RDD[Row] = sc.parallelize( Seq( Row(2.0, 7.0, 1.0), Row(3.5, 2.5, 0.0), Row(7.0, 5.9, 0.0) ) ) // Schema val schema = new StructType() .add(StructField("item_1", DoubleType, true)) .add

Convert JavaPairRDD to JavaRDD

核能气质少年 提交于 2019-12-07 19:59:46
问题 I am fetching data from ElsticSearch using ElasticSearch-Hadoop Library. JavaPairRDD<String, Map<String, Object>> esRDD = JavaEsSpark.esRDD(sc); Now I have JavaPairRDD. I want to use Random Forest from MLLib on this RDD. So I am converting it to JavaPairRDD.toRDD(esRDD) this will give me RDD. Using RDD I am converting again to JavaRDD JavaRDD<LabeledPoint>[] splits = (JavaRDD.fromRDD(JavaPairRDD.toRDD(esRDD), esRDD.classTag())).randomSplit(new double[] { 0.5, 0.5 }); JavaRDD<LabeledPoint>

running pyspark.mllib on Ubuntu

纵饮孤独 提交于 2019-12-07 18:38:20
问题 I'm trying to link Spark in python. Codes bellow is test.py , and I put it under ~/spark/python : from pyspark import SparkContext, SparkConf from pyspark.mllib.fpm import FPGrowth conf = SparkConf().setAppName(appName).setMaster(master) sc = SparkContext(conf=conf) data = sc.textFile("data/mllib/sample_fpgrowth.txt") transactions = data.map(lambda line: line.strip().split(' ')) model = FPGrowth.train(transactions, minSupport=0.2, numPartitions=10) result = model.freqItemsets().collect() for

How to calculate p-values in Spark's Logistic Regression?

梦想与她 提交于 2019-12-07 15:03:58
问题 We are using LogisticRegressionWithSGD and would like to figure out which of our variables predict and with what significance. Some stats packages (StatsModels) return p-values for each term. A low p-value (< 0.05) indicates a meaningful addition to the model. How can we get/calculate p-values from LogisticRegressionWithSGD model? Any help with this is appreciated. 回答1: This is a very old question, but some guidance for people coming to it late might be valuable. LogisticRegressionWithSGD is

Computing Pointwise Mutual Information in Spark

我与影子孤独终老i 提交于 2019-12-07 08:22:14
问题 I'm trying to compute pointwise mutual information (PMI). I have two RDDs as defined here for p(x, y) and p(x) respectively: pii: RDD[((String, String), Double)] pi: RDD[(String, Double)] Any code I'm writing to compute PMI from the RDDs pii and pi is not pretty. My approach is first to flatten the RDD pii and join with pi twice while massaging the tuple elements. val pmi = pii.map(x => (x._1._1, (x._1._2, x._1, x._2))) .join(pi).values .map(x => (x._1._1, (x._1._2, x._1._3, x._2))) .join(pi)

how to add a Incremental column ID for a table in spark SQL

混江龙づ霸主 提交于 2019-12-07 04:58:39
问题 I'm working on a spark mllib algorithm. The dataset I have is in this form Company":"XXXX","CurrentTitle":"XYZ","Edu_Title":"ABC","Exp_mnth":.(there are more values similar to these) Im trying to raw code String values to Numeric values. So, I tried using zipwithuniqueID for unique value for each of the string values.For some reason I'm not able to save the modified dataset to the disk. Can I do this in any way using spark SQL? or what would be the better approach for this? 回答1: Scala val

How to convert from org.apache.spark.mllib.linalg.VectorUDT to ml.linalg.VectorUDT

风格不统一 提交于 2019-12-07 03:13:32
问题 I am using Spark cluster 2.0 and I would like to convert a vector from org.apache.spark.mllib.linalg.VectorUDT to org.apache.spark.ml.linalg.VectorUDT . # Import LinearRegression class from pyspark.ml.regression import LinearRegression # Define LinearRegression algorithm lr = LinearRegression() modelA = lr.fit(data, {lr.regParam:0.0}) Error: u'requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually org.apache.spark.mllib.linalg

how to keep records information when working in Mllib

爷,独闯天下 提交于 2019-12-06 13:29:30
问题 I'm working on a classification problem in which I have to use mllib library. The classification algorithms (let's say Logistic Regression) in mllib require an RDD[LabeledPoint]. A LabeledPoint has only two fields, a label and a feature vector. When doing the scoring (applying my trained model on the test set), my test instances have a few other fields that I'd like to keep. For example, a test instance looks like this <id, field1, field2, label, features> . When I create an RDD of

Understanding Spark MLlib LDA input format

会有一股神秘感。 提交于 2019-12-06 13:21:32
问题 I am trying to implement LDA using Spark MLlib. But I am having difficulty understanding input format. I was able to run its sample implementation to take input from a file which contains only number's as shown : 1 2 6 0 2 3 1 1 0 0 3 1 3 0 1 3 0 0 2 0 0 1 1 4 1 0 0 4 9 0 1 2 0 2 1 0 3 0 0 5 0 2 3 9 3 1 1 9 3 0 2 0 0 1 3 4 2 0 3 4 5 1 1 1 4 0 2 1 0 3 0 0 5 0 2 2 9 1 1 1 9 2 1 2 0 0 1 3 4 4 0 3 4 2 1 3 0 0 0 2 8 2 0 3 0 2 0 2 7 2 1 1 1 9 0 2 2 0 0 3 3 4 1 0 0 4 5 1 3 0 1 0 I followed http:/

After installing sparknlp, cannot import sparknlp

纵然是瞬间 提交于 2019-12-06 13:02:26
The following ran successfully on a Cloudera CDSW cluster gateway. import pyspark from pyspark.sql import SparkSession spark = (SparkSession .builder .config("spark.jars.packages","JohnSnowLabs:spark-nlp:1.2.3") .getOrCreate() ) Which produces this output. Ivy Default Cache set to: /home/cdsw/.ivy2/cache The jars for the packages stored in: /home/cdsw/.ivy2/jars :: loading settings :: url = jar:file:/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/lib/spark2/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml JohnSnowLabs#spark-nlp added as a dependency ::