pyspark | 易学教程

Probability of predictions using Spark LogisticRegressionWithLBFGS for multiclass classification

阅读更多关于 Probability of predictions using Spark LogisticRegressionWithLBFGS for multiclass classification

问题 I am using LogisticRegressionWithLBFGS() to train a model with multiple classes. From the documentation in mllib it is written that clearThreshold() can be used only if the classification is binary. Is there a way to use something similar for multiclass classification in order to output the probabilities of each class in a given input in the model? 回答1: There are two ways to accomplish this. One is to create a method that assumes the responsibility of predictPoint in LogisticRegression.scala

Pyspark - Ranking columns keeping ties

阅读更多关于 Pyspark - Ranking columns keeping ties

问题 I'm looking for a way to rank columns of a dataframe preserving ties. Specifically for this example, I have a pyspark dataframe as follows where I want to generate ranks for colA & colB (though I want to support being able to rank N number of columns) +--------+----------+-----+----+ | Entity| id| colA|colB| +-------------------+-----+----+ | a|8589934652| 21| 50| | b| 112| 9| 23| | c|8589934629| 9| 23| | d|8589934702| 8| 21| | e| 20| 2| 21| | f|8589934657| 2| 5| | g|8589934601| 1| 5| | h

How to limit FPGrowth itemesets to just 2 or 3

阅读更多关于 How to limit FPGrowth itemesets to just 2 or 3

问题 I am running the FPGrowth algorithm using pyspark in python3.6 using jupyter notebook. When I am trying to save the association rules output of rules generated is huge. So I want to limit the number of consequent. Here is the code which I have tried. I also changed the spark context parameters. Maximum Pattern Length fpGrowth (Apache) PySpark from pyspark.sql.functions import col, size from pyspark.ml.fpm import FPGrowth from pyspark.sql import Row from pyspark.context import SparkContext

datatype for handling big numbers in pyspark

阅读更多关于 datatype for handling big numbers in pyspark

问题 I am using spark with python.After uploading a csv file,I needed to parse a column in a csv file which has numbers that are 22 digits long. For parsing that column I used LongType() . I used map() function for defining column. Following are my commands in pyspark. >>> test=sc.textFile("test.csv") >>> header=test.first() >>> schemaString = header.replace('"','') >>> testfields = [StructField(field_name, StringType(), True) for field_name in schemaString.split(',')] >>> testfields[5].dataType =

Put comments in between multi-line statement (with line continuation)

阅读更多关于 Put comments in between multi-line statement (with line continuation)

问题 When i write a following pyspark command: # comment 1 df = df.withColumn('explosion', explode(col('col1'))).filter(col('explosion')['sub_col1'] == 'some_string') \ # comment 2 .withColumn('sub_col2', from_unixtime(col('explosion')['sub_col2'])) \ # comment 3 .withColumn('sub_col3', from_unixtime(col('explosion')['sub_col3'])) I get the following error: .withColumn('sub_col2', from_unixtime(col('explosion')['sub_col2'])) ^ IndentationError: unexpected indent Is there a way to write comments in

How to call a hive UDF written in Java using Pyspark from Hive Context

阅读更多关于 How to call a hive UDF written in Java using Pyspark from Hive Context

问题 I use getLastProcessedVal2 UDF in hive to get the latest partitions from table. This UDF is written in java . I would like to use the same UDF from pyspark using hive context. dfsql_sel_nxt_batch_id_ini=sqlContext.sql(''' select l4_xxxx_seee.**getLastProcessedVal2**("/data/l4/work/hive/l4__stge/proctl_stg","APP_AMLMKTE_L1","L1_AMLMKT_MDWE","TRE_EXTION","2.1")''') Error: ERROR exec.FunctionRegistry: Unable to load UDF class: java.lang.ClassNotFoundException: 回答1: start your pyspark shell as:

H2O Target Mean Encoder “frames are being sent in the same order” ERROR

阅读更多关于 H2O Target Mean Encoder “frames are being sent in the same order” ERROR

问题 I am following the H2O example to run target mean encoding in Sparking Water (sparking water 2.4.2 and H2O 3.22.04). It runs well in all the following paragraph from h2o.targetencoder import TargetEncoder # change label to factor input_df_h2o['label'] = input_df_h2o['label'].asfactor() # add fold column for Target Encoding input_df_h2o["cv_fold_te"] = input_df_h2o.kfold_column(n_folds = 5, seed = 54321) # find all categorical features cat_features = [k for (k,v) in input_df_h2o.types.items()

How to create a table as select in pyspark.sql

阅读更多关于 How to create a table as select in pyspark.sql

问题 Is it possible to create a table on spark using a select statement? I do the following import findspark findspark.init() import pyspark from pyspark.sql import SQLContext sc = pyspark.SparkContext() sqlCtx = SQLContext(sc) spark_df = sqlCtx.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load("./data/documents_topics.csv") spark_df.registerTempTable("my_table") sqlCtx.sql("CREATE TABLE my_table_2 AS SELECT * from my_table") but I get the error /Users/user

reading google bucket data in spark

阅读更多关于 reading google bucket data in spark

问题 I have followed this blog to read data stored in google bucket. https://cloud.google.com/dataproc/docs/connectors/install-storage-connector It has worked fine. The following command hadoop fs -ls gs://the-bucket-you-want-to-list gave me expected results.But when I tried reading data using pyspark using rdd = sc.textFile("gs://crawl_tld_bucket/") , it throws the following error: ` py4j.protocol.Py4JJavaError: An error occurred while calling o20.partitions. : java.io.IOException: No FileSystem

Remove rows from dataframe based on condition in pyspark

阅读更多关于 Remove rows from dataframe based on condition in pyspark

问题 I have one dataframe with two columns: +--------+-----+ | col1| col2| +--------+-----+ |22 | 12.2| |1 | 2.1| |5 | 52.1| |2 | 62.9| |77 | 33.3| I would like to create a new dataframe which will take only rows where "value of col1" > "value of col2" Just as a note the col1 has long type and col2 has double type the result should be like this: +--------+----+ | col1|col2| +--------+----+ |22 |12.2| |77 |33.3| 回答1: Another possible way could be using a where function of DF. For example this: val