pyspark

Customize large datasets comparison in pySpark

给你一囗甜甜゛ 提交于 2020-06-29 04:23:11
问题 I'm using the code below to compare two dataframe and identified differences. However, I'm noticing that I'm simply overwriting my values ( combine_df ). My goal is to Flag if row values are different. But not sure what I"m doing wrong. #Find the overlapping columns in order to compare their values cols = set(module_df.columns) & (set(expected_df.columns)) #create filter dataframes only with the overlapping columns filter_module = expected_df.select(list(cols)) filter_expected = expected_df

PySpark Kafka Error: Missing application resource

谁说胖子不能爱 提交于 2020-06-29 03:55:08
问题 Below error is triggered when i added the below dependency to the code, '--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0,org.apache.spark:spark-streaming-kafka-0-8-assembly_2.11:2.1.1' Below is the code, from pyspark.sql import SparkSession, Row from pyspark.context import SparkContext from kafka import KafkaConsumer import os os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0,org.apache.spark:spark-streaming-kafka-0-8-assembly_2.11:2

Pyspark: Add the average as a new column to DataFrame

混江龙づ霸主 提交于 2020-06-28 05:33:08
问题 I am computing mean of a column in data-frame but it resulted in all the values zeros. Can someone help me in why this is happening? Following is the code and table before and after the transformation of a column. Before computing mean and adding "mean" column result.select("dis_price_released").show(10) +------------------+ |dis_price_released| +------------------+ | 0.0| | 4.0| | 4.0| | 4.0| | 1.0| | 4.0| | 4.0| | 0.0| | 4.0| | 0.0| +------------------+ After computing mean and adding mean

How to compute difference between timestamps with PySpark Structured Streaming

耗尽温柔 提交于 2020-06-28 04:44:46
问题 I have the following problem with PySpark Structured Streaming. Every line in my stream data has a user ID and a timestamp. Now, for every line and for every user, I want to add a column with the difference of the timestamps. For example, suppose the first line that I receive says: "User A, 08:00:00". If the second line says "User A, 08:00:10" then I want to add a column in the second line called "Interval" saying "10 seconds". Is there anyone who knows how to achieve this? I tried to use the

Pyspark got TypeError: can’t pickle _abc_data objects

杀马特。学长 韩版系。学妹 提交于 2020-06-28 04:18:16
问题 I’m trying to generate predictions from a pickled model with pyspark, I get the model with the following command model = deserialize_python_object(filename) with deserialize_python_object(filename) defined as: import pickle def deserialize_python_object(filename): try: with open(filename, ‘rb’) as f: obj = pickle.load(f) except: obj = None return obj the error log looks like: File “/Users/gmg/anaconda3/envs/env/lib**strong text**/python3.7/site-packages/pyspark/sql/udf.py”, line 189, in

Optimize row access and transformation in pyspark

一世执手 提交于 2020-06-28 03:58:42
问题 I have a large dataset(5GB) in the form of jason in S3 bucket. I need to transform the schema of the data, and write back the transformed data to S3 using an ETL script. So I use a crawler to detect the schema and load the data in pyspark dataframe, and change the schema. Now I iterate over every row in the dataframe and convert it to dictionary. Remove null columns, and then convert the dictionary to string and write back to S3. Following is the code: #df is the pyspark dataframe columns =

How to build Spark data frame with filtered records from MongoDB?

自闭症网瘾萝莉.ら 提交于 2020-06-28 03:02:31
问题 My application has been built utilizing MongoDB as a platform. One collection in DB has massive volume of data and have opted for apache spark to retrieve and generate analytical data through calculation. I have configured Spark Connector for MongoDB to communicate with MongoDB. I need to query MongoDB collection using pyspark and build a dataframe consisting of resultset of mongodb query. Please suggest me an appropriate solution to it. 回答1: You can load the data directly into a dataframe

Add column to pyspark dataframe based on a condition [duplicate]

老子叫甜甜 提交于 2020-06-28 01:59:05
问题 This question already has answers here : Spark Equivalent of IF Then ELSE (4 answers) Closed last year . My data.csv file has three columns like given below. I have converted this file to python spark dataframe. A B C | 1 | -3 | 4 | | 2 | 0 | 5 | | 6 | 6 | 6 | I want to add another column D in spark dataframe with values as Yes or No based on the condition that if corresponding value in B column is greater than 0 then yes otherwise No. A B C D | 1 | -3 | 4 | No | | 2 | 0 | 5 | No | | 6 | 6 |

Add column to pyspark dataframe based on a condition [duplicate]

寵の児 提交于 2020-06-28 01:57:57
问题 This question already has answers here : Spark Equivalent of IF Then ELSE (4 answers) Closed last year . My data.csv file has three columns like given below. I have converted this file to python spark dataframe. A B C | 1 | -3 | 4 | | 2 | 0 | 5 | | 6 | 6 | 6 | I want to add another column D in spark dataframe with values as Yes or No based on the condition that if corresponding value in B column is greater than 0 then yes otherwise No. A B C D | 1 | -3 | 4 | No | | 2 | 0 | 5 | No | | 6 | 6 |

Pyspark transform method that's equivalent to the Scala Dataset#transform method

随声附和 提交于 2020-06-27 17:49:05
问题 The Spark Scala API has a Dataset#transform method that makes it easy to chain custom DataFrame transformations like so: val weirdDf = df .transform(myFirstCustomTransformation) .transform(anotherCustomTransformation) I don't see an equivalent transform method for pyspark in the documentation. Is there a PySpark way to chain custom transformations? If not, how can the pyspark.sql.DataFrame class be monkey patched to add a transform method? Update The transform method was added to PySpark as