pyspark-sql | 易学教程

Functions from custom module not working in PySpark, but they work when inputted in interactive mode

阅读更多关于 Functions from custom module not working in PySpark, but they work when inputted in interactive mode

问题 I have a module that I've written containing functions that act on PySpark DataFrames. They do a transformation on columns in the DataFrame and then return a new DataFrame. Here is an example of the code, shortened to include only one of the functions: from pyspark.sql import functions as F from pyspark.sql import types as t import pandas as pd import numpy as np metadta=pd.DataFrame(pd.read_csv("metadata.csv")) # this contains metadata on my dataset def str2num(text): if type(text)==None or

weekofyear() returning seemingly incorrect results for January 1

阅读更多关于 weekofyear() returning seemingly incorrect results for January 1

问题 I'm not quite sure why my code gives 52 as the answer for: weekofyear("01/JAN/2017") . Does anyone have a possible explanation for this? Is there a better way to do this? from pyspark.sql import SparkSession, functions spark = SparkSession.builder.appName('weekOfYear').getOrCreate() from pyspark.sql.functions import to_date df = spark.createDataFrame( [(1, "01/JAN/2017"), (2, "15/FEB/2017")], ("id", "date")) df.show() +---+-----------+ | id| date| +---+-----------+ | 1|01/JAN/2017| | 2|15/FEB

PySpark: Best practice to add more columns to a DataFrame

阅读更多关于 PySpark: Best practice to add more columns to a DataFrame

问题 Spark Dataframes has a method withColumn to add one new column at a time. To add multiple columns, a chain of withColumn s are required. Is this the best practice to do this? I feel that using mapPartitions has more advantages. Let's say I have a chain of three withColumn s and then one filter to remove Row s based on certain conditions. These are four different operations (I am not sure if any of these are wide transformations, though). But I can do it all in one go if I do a mapPartitions .

pyspark dataframe add a column if it doesn't exist

阅读更多关于 pyspark dataframe add a column if it doesn't exist

问题 I have json data in various json files And the keys could be different in lines, for eg {"a":1 , "b":"abc", "c":"abc2", "d":"abc3"} {"a":1 , "b":"abc2", "d":"abc"} {"a":1 ,"b":"abc", "c":"abc2", "d":"abc3"} I want to aggreagate data on column 'b','c','d' and 'f' which is not present in the given json file but could be present in the other files. SO as column 'f' is not present we can take empty string for that column. I am reading the input file and aggregating the data like this import

Trim string column in PySpark dataframe

阅读更多关于 Trim string column in PySpark dataframe

问题 I'm beginner on Python and Spark. After creating a DataFrame from CSV file, I would like to know how I can trim a column. I've try: df = df.withColumn("Product", df.Product.strip()) df is my data frame, Product is a column in my table But I see always the error: Column object is not callable Do you have any suggestions? 回答1: Starting from version 1.5 , Spark SQL provides two specific functions for trimming white space, ltrim and rtrim (search for "trim" in the DataFrame documentation); you'll

How to get name of dataframe column in pyspark?

阅读更多关于 How to get name of dataframe column in pyspark?

问题 In pandas, this can be done by column.name. But how to do the same when its column of spark dataframe? e.g. The calling program has a spark dataframe: spark_df >>> spark_df.columns ['admit', 'gre', 'gpa', 'rank'] This program calls my function: my_function(spark_df['rank']) In my_function, I need the name of the column i.e. 'rank' If it was pandas dataframe, we can use inside my_function >>> pandas_df['rank'].name 'rank' 回答1: You can get the names from the schema by doing spark_df.schema

PySpark When item in list

阅读更多关于 PySpark When item in list

问题 Following is the action I'm trying to achieve: types = ["200","300"] def Count(ID): cnd = F.when((**F.col("type") in types**), 1).otherwise(F.lit(0)) return F.sum(cnd).alias("CountTypes") The syntax in bold is not correct, any suggestions how to get the right syntax here for PySpark? 回答1: I'm not sure about what you are trying to achieve but here is the correct syntax : types = ["200","300"] from pyspark.sql import functions as F cnd = F.when(F.col("type").isin(types),F.lit(1)).otherwise(F

Casting a new derived column in a DataFrame from boolean to integer

阅读更多关于 Casting a new derived column in a DataFrame from boolean to integer

问题 Suppose I have a DataFrame x with this schema: xSchema = StructType([ \ StructField("a", DoubleType(), True), \ StructField("b", DoubleType(), True), \ StructField("c", DoubleType(), True)]) I then have the DataFrame: DataFrame[a :double, b:double, c:double] I would like to have an integer derived column. I am able to create a boolean column: x = x.withColumn('y', (x.a-x.b)/x.c > 1) My new schema is: DataFrame[a :double, b:double, c:double, y: boolean] However, I would like column y to

Max and Min of Spark [duplicate]

阅读更多关于 Max and Min of Spark [duplicate]

问题 This question already has answers here : Find maximum row per group in Spark DataFrame (2 answers) Closed 3 years ago . I am new to Spark and I have some questions about the aggregation function MAX and MIN in SparkSQL In SparkSQL, when I use the MAX / MIN function only MAX(value) / MIN(value) is returned. But How about if I also want other corresponding column? For e.g. Given a dataframe with columns time , value and label , how can I get the time with the MIN(Value) grouped by label ?

Pyspark Unsupported literal type class java.util.ArrayList [duplicate]

阅读更多关于 Pyspark Unsupported literal type class java.util.ArrayList [duplicate]

问题 This question already has answers here : Passing a data frame column and external list to udf under withColumn (3 answers) Closed last year . I am using python3 on Spark(2.2.0). I want to apply my UDF to a specified list of strings. df = ['Apps A','Chrome', 'BBM', 'Apps B', 'Skype'] def calc_app(app, app_list): browser_list = ['Chrome', 'Firefox', 'Opera'] chat_list = ['WhatsApp', 'BBM', 'Skype'] sum = 0 for data in app: name = data['name'] if name in app_list: sum += 1 return sum calc_appUDF