pyspark

Submit a Python project to Dataproc job

痴心易碎 提交于 2020-06-08 19:13:33
问题 I have a python project, whose folder has the structure main_directory - lib - lib.py - run - script.py script.py is from lib.lib import add_two spark = SparkSession \ .builder \ .master('yarn') \ .appName('script') \ .getOrCreate() print(add_two(1,2)) and lib.py is def add_two(x,y): return x+y I want to launch as a Dataproc job in GCP. I have checked online, but I have not understood well how to do it. I am trying to launch the script with gcloud dataproc jobs submit pyspark --cluster=

pySpark mapping multiple variables

天涯浪子 提交于 2020-06-05 11:39:15
问题 The code below maps values and column names of my reference df with my actual dataset, finding exact matches and if an exact match is found, return the OutputItemNameByValue . However, I'm trying to add the rule that when PrimaryLookupAttributeValue = DEFAULT to also return the OutputItemNameByValue . The solution I'm trying out to tackle this is to create a new dataframe with null values - since there was no match provided by code below. Thus the next step would be to target the null values

pySpark mapping multiple variables

孤街醉人 提交于 2020-06-05 11:36:42
问题 The code below maps values and column names of my reference df with my actual dataset, finding exact matches and if an exact match is found, return the OutputItemNameByValue . However, I'm trying to add the rule that when PrimaryLookupAttributeValue = DEFAULT to also return the OutputItemNameByValue . The solution I'm trying out to tackle this is to create a new dataframe with null values - since there was no match provided by code below. Thus the next step would be to target the null values

pySpark mapping multiple variables

大城市里の小女人 提交于 2020-06-05 11:35:11
问题 The code below maps values and column names of my reference df with my actual dataset, finding exact matches and if an exact match is found, return the OutputItemNameByValue . However, I'm trying to add the rule that when PrimaryLookupAttributeValue = DEFAULT to also return the OutputItemNameByValue . The solution I'm trying out to tackle this is to create a new dataframe with null values - since there was no match provided by code below. Thus the next step would be to target the null values

ETL 1.5 GB Dataframe within pyspark on AWS EMR

孤人 提交于 2020-06-01 07:39:21
问题 I'm using an EMR cluster with 1 Master (m5.2x large) and 4 core nodes (c5.2xlarge) and running a PySpark job on it which will join 5 fact tables 150 columns and 100k rows each and 5 small dimension tables 10 columns each with less than 100 records. When I join all these tables the resultant dataframe will have 600 columns and 420k records (approximately 1.5 GB of data). Please suggest me something here, I'm from a SQL and DWH backgound. Hence I have used a single SQL query to join all 5 facts

Can't apply a pandas_udf in pyspark

我怕爱的太早我们不能终老 提交于 2020-06-01 07:19:04
问题 I'm trying out some pyspark related experiments on jupyter notebook attached to an AWS EMR instance. I've a spark dataframe which reads data from s3, and then filters out some stuffs. Printing the schema using df1.printSchema() outputs like this: root |-- idvalue: string (nullable = true) |-- locationaccuracyhorizontal: float (nullable = true) |-- hour: integer (nullable = true) |-- day: integer (nullable = true) |-- date: date (nullable = true) |-- is_weekend: boolean (nullable = true) |--

PySpark - Explode columns into rows based on the type of the column

淺唱寂寞╮ 提交于 2020-06-01 05:36:26
问题 Given a Dataframe: +---+-----------+---------+-------+------------+ | id| score|tx_amount|isValid| greeting| +---+-----------+---------+-------+------------+ | 1| 0.2| 23.78| true| hello_world| | 2| 0.6| 12.41| false|byebye_world| +---+-----------+---------+-------+------------+ I want to explode these columns into a Row named "col_value" using the types of the input Dataframe. df.dtypes [('id', 'int'), ('model_score', 'double'), ('tx_amount', 'double'), ('isValid', 'boolean'), ('greeting',

pySpark iterating repetitive variables

孤者浪人 提交于 2020-06-01 04:02:13
问题 I have a code that currently works, however I'm looking to make it more efficient and avoid hard coding: 1) avoid hard coding: for NotDefined_filterDomainLookup will like to reference the default_reference df for the corresponding Code and Name when Id = 4. Instead of hard coding the Code and Name value. 2) I repeat the same code and process for Id/Code/Name. Is there a way to loop all of that instead of coding each scenario? How can I iterate over the current logic? Question 1 list of

Loading data from GCS using Spark Local

◇◆丶佛笑我妖孽 提交于 2020-05-31 04:08:06
问题 I am trying to read data from GCS buckets on my local machine, for testing purposes. I would like to sample some of the data in the cloud I have downloaded the GCS Hadoop Connector JAR. And setup the sparkConf as follow: conf = SparkConf() \ .setMaster("local[8]") \ .setAppName("Test") \ .set("spark.jars", "path/gcs-connector-hadoop2-latest.jar") \ .set("spark.hadoop.google.cloud.auth.service.account.enable", "true") \ .set("spark.hadoop.google.cloud.auth.service.account.json.keyfile", "path

Pyspark: Extract date from Datetime value

僤鯓⒐⒋嵵緔 提交于 2020-05-28 15:20:54
问题 I am trying to figure out, how to extract a date from a datetime value using Pyspark sql. The datetime values look like this: DateTime 2018-05-21T00:00:00.000-04:00 2016-02-22T02:00:02.234-06:00 When I now load this into a spark dataframe and try to extract the date (via Date() or Timestamp() and then Date() I always get the error, that a date or timestamp value is expected, but a DateTime value was provided. Can someone help me with retrieving the date from this value? I think, you need to