pyspark

Probability of predictions using Spark LogisticRegressionWithLBFGS for multiclass classification

落爺英雄遲暮 提交于 2019-12-24 00:45:43
问题 I am using LogisticRegressionWithLBFGS() to train a model with multiple classes. From the documentation in mllib it is written that clearThreshold() can be used only if the classification is binary. Is there a way to use something similar for multiclass classification in order to output the probabilities of each class in a given input in the model? 回答1: There are two ways to accomplish this. One is to create a method that assumes the responsibility of predictPoint in LogisticRegression.scala

Pyspark - Ranking columns keeping ties

耗尽温柔 提交于 2019-12-24 00:45:19
问题 I'm looking for a way to rank columns of a dataframe preserving ties. Specifically for this example, I have a pyspark dataframe as follows where I want to generate ranks for colA & colB (though I want to support being able to rank N number of columns) +--------+----------+-----+----+ | Entity| id| colA|colB| +-------------------+-----+----+ | a|8589934652| 21| 50| | b| 112| 9| 23| | c|8589934629| 9| 23| | d|8589934702| 8| 21| | e| 20| 2| 21| | f|8589934657| 2| 5| | g|8589934601| 1| 5| | h

How to limit FPGrowth itemesets to just 2 or 3

好久不见. 提交于 2019-12-24 00:25:07
问题 I am running the FPGrowth algorithm using pyspark in python3.6 using jupyter notebook. When I am trying to save the association rules output of rules generated is huge. So I want to limit the number of consequent. Here is the code which I have tried. I also changed the spark context parameters. Maximum Pattern Length fpGrowth (Apache) PySpark from pyspark.sql.functions import col, size from pyspark.ml.fpm import FPGrowth from pyspark.sql import Row from pyspark.context import SparkContext

datatype for handling big numbers in pyspark

懵懂的女人 提交于 2019-12-23 23:56:26
问题 I am using spark with python.After uploading a csv file,I needed to parse a column in a csv file which has numbers that are 22 digits long. For parsing that column I used LongType() . I used map() function for defining column. Following are my commands in pyspark. >>> test=sc.textFile("test.csv") >>> header=test.first() >>> schemaString = header.replace('"','') >>> testfields = [StructField(field_name, StringType(), True) for field_name in schemaString.split(',')] >>> testfields[5].dataType =

Put comments in between multi-line statement (with line continuation)

北城以北 提交于 2019-12-23 23:25:45
问题 When i write a following pyspark command: # comment 1 df = df.withColumn('explosion', explode(col('col1'))).filter(col('explosion')['sub_col1'] == 'some_string') \ # comment 2 .withColumn('sub_col2', from_unixtime(col('explosion')['sub_col2'])) \ # comment 3 .withColumn('sub_col3', from_unixtime(col('explosion')['sub_col3'])) I get the following error: .withColumn('sub_col2', from_unixtime(col('explosion')['sub_col2'])) ^ IndentationError: unexpected indent Is there a way to write comments in

How to call a hive UDF written in Java using Pyspark from Hive Context

你说的曾经没有我的故事 提交于 2019-12-23 23:16:12
问题 I use getLastProcessedVal2 UDF in hive to get the latest partitions from table. This UDF is written in java . I would like to use the same UDF from pyspark using hive context. dfsql_sel_nxt_batch_id_ini=sqlContext.sql(''' select l4_xxxx_seee.**getLastProcessedVal2**("/data/l4/work/hive/l4__stge/proctl_stg","APP_AMLMKTE_L1","L1_AMLMKT_MDWE","TRE_EXTION","2.1")''') Error: ERROR exec.FunctionRegistry: Unable to load UDF class: java.lang.ClassNotFoundException: 回答1: start your pyspark shell as:

H2O Target Mean Encoder “frames are being sent in the same order” ERROR

好久不见. 提交于 2019-12-23 22:09:42
问题 I am following the H2O example to run target mean encoding in Sparking Water (sparking water 2.4.2 and H2O 3.22.04). It runs well in all the following paragraph from h2o.targetencoder import TargetEncoder # change label to factor input_df_h2o['label'] = input_df_h2o['label'].asfactor() # add fold column for Target Encoding input_df_h2o["cv_fold_te"] = input_df_h2o.kfold_column(n_folds = 5, seed = 54321) # find all categorical features cat_features = [k for (k,v) in input_df_h2o.types.items()

How to create a table as select in pyspark.sql

谁说胖子不能爱 提交于 2019-12-23 20:46:11
问题 Is it possible to create a table on spark using a select statement? I do the following import findspark findspark.init() import pyspark from pyspark.sql import SQLContext sc = pyspark.SparkContext() sqlCtx = SQLContext(sc) spark_df = sqlCtx.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load("./data/documents_topics.csv") spark_df.registerTempTable("my_table") sqlCtx.sql("CREATE TABLE my_table_2 AS SELECT * from my_table") but I get the error /Users/user

reading google bucket data in spark

回眸只為那壹抹淺笑 提交于 2019-12-23 20:18:00
问题 I have followed this blog to read data stored in google bucket. https://cloud.google.com/dataproc/docs/connectors/install-storage-connector It has worked fine. The following command hadoop fs -ls gs://the-bucket-you-want-to-list gave me expected results.But when I tried reading data using pyspark using rdd = sc.textFile("gs://crawl_tld_bucket/") , it throws the following error: ` py4j.protocol.Py4JJavaError: An error occurred while calling o20.partitions. : java.io.IOException: No FileSystem

Remove rows from dataframe based on condition in pyspark

元气小坏坏 提交于 2019-12-23 19:54:52
问题 I have one dataframe with two columns: +--------+-----+ | col1| col2| +--------+-----+ |22 | 12.2| |1 | 2.1| |5 | 52.1| |2 | 62.9| |77 | 33.3| I would like to create a new dataframe which will take only rows where "value of col1" > "value of col2" Just as a note the col1 has long type and col2 has double type the result should be like this: +--------+----+ | col1|col2| +--------+----+ |22 |12.2| |77 |33.3| 回答1: Another possible way could be using a where function of DF. For example this: val