pyspark | 易学教程

how to properly use pyspark to send data to kafka broker?

阅读更多关于 how to properly use pyspark to send data to kafka broker?

问题 I'm trying to write a simple pyspark job, which would receive data from a kafka broker topic, did some transformation on that data, and put the transformed data on a different kafka broker topic. I have the following code, which reads data from a kafka topic, but has no effect running sendkafka function: from pyspark import SparkConf, SparkContext from operator import add import sys from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils import json from

issue in encoding non-numeric feature to numeric in Spark and Ipython

阅读更多关于 issue in encoding non-numeric feature to numeric in Spark and Ipython

问题 I am working on something where I have to make predictions for numeric data (monthly employee spending) using non-numeric features. I am using Spark MLlibs Random Forests algorthim . I have my features data in a dataframe which looks like this: _1 _2 _3 _4 0 Level1 Male New York New York 1 Level1 Male San Fransisco California 2 Level2 Male New York New York 3 Level1 Male Columbus Ohio 4 Level3 Male New York New York 5 Level4 Male Columbus Ohio 6 Level5 Female Stamford Connecticut 7 Level1

Convert StringType to ArrayType in PySpark

阅读更多关于 Convert StringType to ArrayType in PySpark

问题 I am trying to Run the FPGrowth algorithm in PySpark on my Dataset. from pyspark.ml.fpm import FPGrowth fpGrowth = FPGrowth(itemsCol="name", minSupport=0.5,minConfidence=0.6) model = fpGrowth.fit(df) I am getting the following error: An error occurred while calling o2139.fit. : java.lang.IllegalArgumentException: requirement failed: The input column must be ArrayType, but got StringType. at scala.Predef$.require(Predef.scala:224) My Dataframe df is in the form: df.show(2) +---+---------+-----

spark dataframe filter operation

阅读更多关于 spark dataframe filter operation

问题 I have a spark dataframe and then filter string to apply, filter only selects the some rows but I would like to know the reason for the rows not selected. Example: DataFrame columns: customer_id|col_a|col_b|col_c|col_d Filter string: col_a > 0 & col_b > 4 & col_c < 0 & col_d=0 etc... reason_for_exclusion can be any string or letter as long as it says why particular row excluded. I could split the filter string and apply each filter but I have huge filter string and it would be inefficient so

Saving result of DataFrame show() to string in pyspark

阅读更多关于 Saving result of DataFrame show() to string in pyspark

问题 I would like to capture the result of show in pyspark, similar to here and here. I was not able to find a solution with pyspark, only scala. df.show() #+----+-------+ #| age| name| #+----+-------+ #|null|Michael| #| 30| Andy| #| 19| Justin| #+----+-------+ The ultimate purpose is to capture this as string inside my logger.info I tried logger.info(df.show()) which will only display on console. 回答1: You can build a helper function using the same approach as shown in post you linked Capturing

Saving result of DataFrame show() to string in pyspark

阅读更多关于 Saving result of DataFrame show() to string in pyspark

Getting the leaf probabilities of a tree model in spark

阅读更多关于 Getting the leaf probabilities of a tree model in spark

问题 I'm trying to refactor a trained spark tree-based model (RandomForest or GBT classifiers) in such a way it can be exported in environments without spark. The toDebugString method is a good starting point. However, in the case of RandomForestClassifier , the string just shows the predicted class for each tree, without the relative probabilities. So, if you average the prediction for all the trees, you get a wrong result. An example. We have a DecisionTree represented in this way:

Detected cartesian product for INNER join on literal column in PySpark

阅读更多关于 Detected cartesian product for INNER join on literal column in PySpark

问题 The following code raises "Detected cartesian product for INNER join" exception: first_df = spark.createDataFrame([{"first_id": "1"}, {"first_id": "1"}, {"first_id": "1"}, ]) second_df = spark.createDataFrame([{"some_value": "????"}, ]) second_df = second_df.withColumn("second_id", F.lit("1")) # If the next line is uncommented, then the JOIN is working fine. # second_df.persist() result_df = first_df.join(second_df, first_df.first_id == second_df.second_id, 'inner') data = result_df.collect()

How do I run pyspark with jupyter notebook?

阅读更多关于 How do I run pyspark with jupyter notebook?

问题 I am trying to fire the jupyter notebook when I run the command pyspark in the console. When I type it now, it only starts and interactive shell in the console. However, this is not convenient to type long lines of code. Is there are way to connect the jupyter notebook to pyspark shell? Thanks. 回答1: Assuming you have Spark installed wherever you are going to run Jupyter, I'd recommend you use findspark. Once you pip install findspark , you can just import findspark findspark.init() import

Pyspark Save dataframe to S3

阅读更多关于 Pyspark Save dataframe to S3

问题 I want to save dataframe to s3 but when I save the file to s3 , it creates empty file with ${folder_name} , in which I want to save the file. Syntax to save the dataframe :- f.write.parquet("s3n://bucket-name/shri/test") It saves the file in test folder but it creates $test under shri . Is there a way I can save it without creating that extra folder? 回答1: I was able to do it by using below code. df.write.parquet("s3a://bucket-name/shri/test.parquet",mode="overwrite") 回答2: As far as I know,