pyspark

how to properly use pyspark to send data to kafka broker?

↘锁芯ラ 提交于 2020-01-22 06:45:15
问题 I'm trying to write a simple pyspark job, which would receive data from a kafka broker topic, did some transformation on that data, and put the transformed data on a different kafka broker topic. I have the following code, which reads data from a kafka topic, but has no effect running sendkafka function: from pyspark import SparkConf, SparkContext from operator import add import sys from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils import json from

issue in encoding non-numeric feature to numeric in Spark and Ipython

旧街凉风 提交于 2020-01-22 02:14:10
问题 I am working on something where I have to make predictions for numeric data (monthly employee spending) using non-numeric features. I am using Spark MLlibs Random Forests algorthim . I have my features data in a dataframe which looks like this: _1 _2 _3 _4 0 Level1 Male New York New York 1 Level1 Male San Fransisco California 2 Level2 Male New York New York 3 Level1 Male Columbus Ohio 4 Level3 Male New York New York 5 Level4 Male Columbus Ohio 6 Level5 Female Stamford Connecticut 7 Level1

Convert StringType to ArrayType in PySpark

|▌冷眼眸甩不掉的悲伤 提交于 2020-01-21 15:27:04
问题 I am trying to Run the FPGrowth algorithm in PySpark on my Dataset. from pyspark.ml.fpm import FPGrowth fpGrowth = FPGrowth(itemsCol="name", minSupport=0.5,minConfidence=0.6) model = fpGrowth.fit(df) I am getting the following error: An error occurred while calling o2139.fit. : java.lang.IllegalArgumentException: requirement failed: The input column must be ArrayType, but got StringType. at scala.Predef$.require(Predef.scala:224) My Dataframe df is in the form: df.show(2) +---+---------+-----

spark dataframe filter operation

社会主义新天地 提交于 2020-01-21 14:34:22
问题 I have a spark dataframe and then filter string to apply, filter only selects the some rows but I would like to know the reason for the rows not selected. Example: DataFrame columns: customer_id|col_a|col_b|col_c|col_d Filter string: col_a > 0 & col_b > 4 & col_c < 0 & col_d=0 etc... reason_for_exclusion can be any string or letter as long as it says why particular row excluded. I could split the filter string and apply each filter but I have huge filter string and it would be inefficient so

Saving result of DataFrame show() to string in pyspark

依然范特西╮ 提交于 2020-01-21 11:51:50
问题 I would like to capture the result of show in pyspark, similar to here and here. I was not able to find a solution with pyspark, only scala. df.show() #+----+-------+ #| age| name| #+----+-------+ #|null|Michael| #| 30| Andy| #| 19| Justin| #+----+-------+ The ultimate purpose is to capture this as string inside my logger.info I tried logger.info(df.show()) which will only display on console. 回答1: You can build a helper function using the same approach as shown in post you linked Capturing

Saving result of DataFrame show() to string in pyspark

本秂侑毒 提交于 2020-01-21 11:51:35
问题 I would like to capture the result of show in pyspark, similar to here and here. I was not able to find a solution with pyspark, only scala. df.show() #+----+-------+ #| age| name| #+----+-------+ #|null|Michael| #| 30| Andy| #| 19| Justin| #+----+-------+ The ultimate purpose is to capture this as string inside my logger.info I tried logger.info(df.show()) which will only display on console. 回答1: You can build a helper function using the same approach as shown in post you linked Capturing

Getting the leaf probabilities of a tree model in spark

北战南征 提交于 2020-01-21 11:08:09
问题 I'm trying to refactor a trained spark tree-based model (RandomForest or GBT classifiers) in such a way it can be exported in environments without spark. The toDebugString method is a good starting point. However, in the case of RandomForestClassifier , the string just shows the predicted class for each tree, without the relative probabilities. So, if you average the prediction for all the trees, you get a wrong result. An example. We have a DecisionTree represented in this way:

Detected cartesian product for INNER join on literal column in PySpark

你。 提交于 2020-01-21 07:53:30
问题 The following code raises "Detected cartesian product for INNER join" exception: first_df = spark.createDataFrame([{"first_id": "1"}, {"first_id": "1"}, {"first_id": "1"}, ]) second_df = spark.createDataFrame([{"some_value": "????"}, ]) second_df = second_df.withColumn("second_id", F.lit("1")) # If the next line is uncommented, then the JOIN is working fine. # second_df.persist() result_df = first_df.join(second_df, first_df.first_id == second_df.second_id, 'inner') data = result_df.collect()

How do I run pyspark with jupyter notebook?

做~自己de王妃 提交于 2020-01-21 05:47:06
问题 I am trying to fire the jupyter notebook when I run the command pyspark in the console. When I type it now, it only starts and interactive shell in the console. However, this is not convenient to type long lines of code. Is there are way to connect the jupyter notebook to pyspark shell? Thanks. 回答1: Assuming you have Spark installed wherever you are going to run Jupyter, I'd recommend you use findspark. Once you pip install findspark , you can just import findspark findspark.init() import

Pyspark Save dataframe to S3

风流意气都作罢 提交于 2020-01-21 03:22:09
问题 I want to save dataframe to s3 but when I save the file to s3 , it creates empty file with ${folder_name} , in which I want to save the file. Syntax to save the dataframe :- f.write.parquet("s3n://bucket-name/shri/test") It saves the file in test folder but it creates $test under shri . Is there a way I can save it without creating that extra folder? 回答1: I was able to do it by using below code. df.write.parquet("s3a://bucket-name/shri/test.parquet",mode="overwrite") 回答2: As far as I know,