pyspark | 易学教程

Submit a Python project to Dataproc job

阅读更多关于 Submit a Python project to Dataproc job

问题 I have a python project, whose folder has the structure main_directory - lib - lib.py - run - script.py script.py is from lib.lib import add_two spark = SparkSession \ .builder \ .master('yarn') \ .appName('script') \ .getOrCreate() print(add_two(1,2)) and lib.py is def add_two(x,y): return x+y I want to launch as a Dataproc job in GCP. I have checked online, but I have not understood well how to do it. I am trying to launch the script with gcloud dataproc jobs submit pyspark --cluster=

pySpark mapping multiple variables

阅读更多关于 pySpark mapping multiple variables

问题 The code below maps values and column names of my reference df with my actual dataset, finding exact matches and if an exact match is found, return the OutputItemNameByValue . However, I'm trying to add the rule that when PrimaryLookupAttributeValue = DEFAULT to also return the OutputItemNameByValue . The solution I'm trying out to tackle this is to create a new dataframe with null values - since there was no match provided by code below. Thus the next step would be to target the null values

pySpark mapping multiple variables

阅读更多关于 pySpark mapping multiple variables

pySpark mapping multiple variables

阅读更多关于 pySpark mapping multiple variables

ETL 1.5 GB Dataframe within pyspark on AWS EMR

阅读更多关于 ETL 1.5 GB Dataframe within pyspark on AWS EMR

问题 I'm using an EMR cluster with 1 Master (m5.2x large) and 4 core nodes (c5.2xlarge) and running a PySpark job on it which will join 5 fact tables 150 columns and 100k rows each and 5 small dimension tables 10 columns each with less than 100 records. When I join all these tables the resultant dataframe will have 600 columns and 420k records (approximately 1.5 GB of data). Please suggest me something here, I'm from a SQL and DWH backgound. Hence I have used a single SQL query to join all 5 facts

Can't apply a pandas_udf in pyspark

阅读更多关于 Can't apply a pandas_udf in pyspark

问题 I'm trying out some pyspark related experiments on jupyter notebook attached to an AWS EMR instance. I've a spark dataframe which reads data from s3, and then filters out some stuffs. Printing the schema using df1.printSchema() outputs like this: root |-- idvalue: string (nullable = true) |-- locationaccuracyhorizontal: float (nullable = true) |-- hour: integer (nullable = true) |-- day: integer (nullable = true) |-- date: date (nullable = true) |-- is_weekend: boolean (nullable = true) |--

PySpark - Explode columns into rows based on the type of the column

阅读更多关于 PySpark - Explode columns into rows based on the type of the column

问题 Given a Dataframe: +---+-----------+---------+-------+------------+ | id| score|tx_amount|isValid| greeting| +---+-----------+---------+-------+------------+ | 1| 0.2| 23.78| true| hello_world| | 2| 0.6| 12.41| false|byebye_world| +---+-----------+---------+-------+------------+ I want to explode these columns into a Row named "col_value" using the types of the input Dataframe. df.dtypes [('id', 'int'), ('model_score', 'double'), ('tx_amount', 'double'), ('isValid', 'boolean'), ('greeting',

pySpark iterating repetitive variables

阅读更多关于 pySpark iterating repetitive variables

问题 I have a code that currently works, however I'm looking to make it more efficient and avoid hard coding: 1) avoid hard coding: for NotDefined_filterDomainLookup will like to reference the default_reference df for the corresponding Code and Name when Id = 4. Instead of hard coding the Code and Name value. 2) I repeat the same code and process for Id/Code/Name. Is there a way to loop all of that instead of coding each scenario? How can I iterate over the current logic? Question 1 list of

Loading data from GCS using Spark Local

阅读更多关于 Loading data from GCS using Spark Local

问题 I am trying to read data from GCS buckets on my local machine, for testing purposes. I would like to sample some of the data in the cloud I have downloaded the GCS Hadoop Connector JAR. And setup the sparkConf as follow: conf = SparkConf() \ .setMaster("local[8]") \ .setAppName("Test") \ .set("spark.jars", "path/gcs-connector-hadoop2-latest.jar") \ .set("spark.hadoop.google.cloud.auth.service.account.enable", "true") \ .set("spark.hadoop.google.cloud.auth.service.account.json.keyfile", "path

Pyspark: Extract date from Datetime value

阅读更多关于 Pyspark: Extract date from Datetime value

问题 I am trying to figure out, how to extract a date from a datetime value using Pyspark sql. The datetime values look like this: DateTime 2018-05-21T00:00:00.000-04:00 2016-02-22T02:00:02.234-06:00 When I now load this into a spark dataframe and try to extract the date (via Date() or Timestamp() and then Date() I always get the error, that a date or timestamp value is expected, but a DateTime value was provided. Can someone help me with retrieving the date from this value? I think, you need to