pyspark-sql | 易学教程

pyspark approxQuantile function

阅读更多关于 pyspark approxQuantile function

I have dataframe with these columns id , price , timestamp . I would like to find median value grouped by id . I am using this code to find it but it's giving me this error. from pyspark.sql import DataFrameStatFunctions as statFunc windowSpec = Window.partitionBy("id") median = statFunc.approxQuantile("price", [0.5], 0) \ .over(windowSpec) return df.withColumn("Median", median) Is it not possible to use DataFrameStatFunctions to fill values in new column? TypeError: unbound method approxQuantile() must be called with DataFrameStatFunctions instance as first argument (got str instance instead)

Median / quantiles within PySpark groupBy

阅读更多关于 Median / quantiles within PySpark groupBy

问题 I would like to calculate group quantiles on a Spark dataframe (using PySpark). Either an approximate or exact result would be fine. I prefer a solution that I can use within the context of groupBy / agg , so that I can mix it with other PySpark aggregate functions. If this is not possible for some reason, a different approach would be fine as well. This question is related but does not indicate how to use approxQuantile as an aggregate function. I also have access to the percentile_approx

pyspark show dataframe as table with horizontal scroll in ipython notebook

阅读更多关于 pyspark show dataframe as table with horizontal scroll in ipython notebook

a pyspark.sql.DataFrame displays messy with DataFrame.show() - lines wrap instead of a scroll. but displays with pandas.DataFrame.head I tried these options import IPython IPython.auto_scroll_threshold = 9999 from IPython.core.interactiveshell import InteractiveShell InteractiveShell.ast_node_interactivity = "all" from IPython.display import display but no luck. Although the scroll works when used within Atom editor with jupyter plugin: this is a workaround spark_df.limit(5).toPandas().head() although, I do not know the computational burden of this query. I am thinking limit() is not expensive

Pyspark: filter dataframe by regex with string formatting?

阅读更多关于 Pyspark: filter dataframe by regex with string formatting?

I've read several posts on using the "like" operator to filter a spark dataframe by the condition of containing a string/expression, but was wondering if the following is a "best-practice" on using %s in the desired condition as follows: input_path = <s3_location_str> my_expr = "Arizona.*hot" # a regex expression dx = sqlContext.read.parquet(input_path) # "keyword" is a field in dx # is the following correct? substr = "'%%%s%%'" %my_keyword # escape % via %% to get "%" dk = dx.filter("keyword like %s" %substr) # dk should contain rows with keyword values such as "Arizona is hot." Note I'm

Median / quantiles within PySpark groupBy

阅读更多关于 Median / quantiles within PySpark groupBy

I would like to calculate group quantiles on a Spark dataframe (using PySpark). Either an approximate or exact result would be fine. I prefer a solution that I can use within the context of groupBy / agg , so that I can mix it with other PySpark aggregate functions. If this is not possible for some reason, a different approach would be fine as well. This question is related but does not indicate how to use approxQuantile as an aggregate function. I also have access to the percentile_approx Hive UDF but I don't know how to use it as an aggregate function. For the sake of specificity, suppose I

Pyspark - Split a column and take n elements

阅读更多关于 Pyspark - Split a column and take n elements

问题 I want to take a column and split a string using a character. As per usual, I understood that the method split would return a list, but when coding I found that the returning object had only the methods getItem or getField with the following descriptions from the API: @since(1.3) def getItem(self, key): """ An expression that gets an item at position ``ordinal`` out of a list, or gets an item by key out of a dict. @since(1.3) def getField(self, name): """ An expression that gets a field by

How do I truncate a PySpark dataframe of timestamp type to the day?

阅读更多关于 How do I truncate a PySpark dataframe of timestamp type to the day?

问题 I have a PySpark dataframe that includes timestamps in a column (call the column 'dt'), like this: 2018-04-07 16:46:00 2018-03-06 22:18:00 When I execute: SELECT trunc(dt, 'day') as day ...I expected: 2018-04-07 00:00:00 2018-03-06 00:00:00 But I got: null null How do I truncate to the day instead of the hour? 回答1: You use wrong function. trunc supports only a few formats: Returns date truncated to the unit specified by the format. :param format: 'year', 'yyyy', 'yy' or 'month', 'mon', 'mm'

How to increase the default precision and scale while loading data from oracle using spark-sql

阅读更多关于 How to increase the default precision and scale while loading data from oracle using spark-sql

问题 Trying to load a data from oracle table where I have few columns hold floating point values , some times it holds upto DecimalType(40,20) i.e. 20 digits after point. Currently when I load its columns using var local_ora_df: DataFrameReader = ora_df; local_ora_df.option("partitionColumn", "FISCAL_YEAR") local_ora_df .option("schema",schema) .option("dbtable", query) .load() It is holding 10 digits after point i.e. decimal(38,10) (nullable = true) If I want to increase digits after point while

PySpark When item in list

阅读更多关于 PySpark When item in list

Following is the action I'm trying to achieve: types = ["200","300"] def Count(ID): cnd = F.when((**F.col("type") in types**), 1).otherwise(F.lit(0)) return F.sum(cnd).alias("CountTypes") The syntax in bold is not correct, any suggestions how to get the right syntax here for PySpark? I'm not sure about what you are trying to achieve but here is the correct syntax : types = ["200","300"] from pyspark.sql import functions as F cnd = F.when(F.col("type").isin(types),F.lit(1)).otherwise(F.lit(0)) sum_on_cnd = F.sum(cnd).alias("count_types") # Column<b'sum(CASE WHEN (type IN (200, 300)) THEN 1 ELSE

How do I truncate a PySpark dataframe of timestamp type to the day?

阅读更多关于 How do I truncate a PySpark dataframe of timestamp type to the day?

I have a PySpark dataframe that includes timestamps in a column (call the column 'dt'), like this: 2018-04-07 16:46:00 2018-03-06 22:18:00 When I execute: SELECT trunc(dt, 'day') as day ...I expected: 2018-04-07 00:00:00 2018-03-06 00:00:00 But I got: null null How do I truncate to the day instead of the hour? You use wrong function. trunc supports only a few formats : Returns date truncated to the unit specified by the format. :param format: 'year', 'yyyy', 'yy' or 'month', 'mon', 'mm' Use date_trunc instead : Returns timestamp truncated to the unit specified by the format. :param format: