pyspark-sql

pyspark approxQuantile function

ε祈祈猫儿з 提交于 2019-12-03 07:55:50
I have dataframe with these columns id , price , timestamp . I would like to find median value grouped by id . I am using this code to find it but it's giving me this error. from pyspark.sql import DataFrameStatFunctions as statFunc windowSpec = Window.partitionBy("id") median = statFunc.approxQuantile("price", [0.5], 0) \ .over(windowSpec) return df.withColumn("Median", median) Is it not possible to use DataFrameStatFunctions to fill values in new column? TypeError: unbound method approxQuantile() must be called with DataFrameStatFunctions instance as first argument (got str instance instead)

Median / quantiles within PySpark groupBy

别等时光非礼了梦想. 提交于 2019-12-03 03:39:39
问题 I would like to calculate group quantiles on a Spark dataframe (using PySpark). Either an approximate or exact result would be fine. I prefer a solution that I can use within the context of groupBy / agg , so that I can mix it with other PySpark aggregate functions. If this is not possible for some reason, a different approach would be fine as well. This question is related but does not indicate how to use approxQuantile as an aggregate function. I also have access to the percentile_approx

pyspark show dataframe as table with horizontal scroll in ipython notebook

拥有回忆 提交于 2019-12-03 02:16:26
a pyspark.sql.DataFrame displays messy with DataFrame.show() - lines wrap instead of a scroll. but displays with pandas.DataFrame.head I tried these options import IPython IPython.auto_scroll_threshold = 9999 from IPython.core.interactiveshell import InteractiveShell InteractiveShell.ast_node_interactivity = "all" from IPython.display import display but no luck. Although the scroll works when used within Atom editor with jupyter plugin: this is a workaround spark_df.limit(5).toPandas().head() although, I do not know the computational burden of this query. I am thinking limit() is not expensive

Pyspark: filter dataframe by regex with string formatting?

故事扮演 提交于 2019-12-02 23:22:30
I've read several posts on using the "like" operator to filter a spark dataframe by the condition of containing a string/expression, but was wondering if the following is a "best-practice" on using %s in the desired condition as follows: input_path = <s3_location_str> my_expr = "Arizona.*hot" # a regex expression dx = sqlContext.read.parquet(input_path) # "keyword" is a field in dx # is the following correct? substr = "'%%%s%%'" %my_keyword # escape % via %% to get "%" dk = dx.filter("keyword like %s" %substr) # dk should contain rows with keyword values such as "Arizona is hot." Note I'm

Median / quantiles within PySpark groupBy

走远了吗. 提交于 2019-12-02 17:08:05
I would like to calculate group quantiles on a Spark dataframe (using PySpark). Either an approximate or exact result would be fine. I prefer a solution that I can use within the context of groupBy / agg , so that I can mix it with other PySpark aggregate functions. If this is not possible for some reason, a different approach would be fine as well. This question is related but does not indicate how to use approxQuantile as an aggregate function. I also have access to the percentile_approx Hive UDF but I don't know how to use it as an aggregate function. For the sake of specificity, suppose I

Pyspark - Split a column and take n elements

Deadly 提交于 2019-12-02 13:11:37
问题 I want to take a column and split a string using a character. As per usual, I understood that the method split would return a list, but when coding I found that the returning object had only the methods getItem or getField with the following descriptions from the API: @since(1.3) def getItem(self, key): """ An expression that gets an item at position ``ordinal`` out of a list, or gets an item by key out of a dict. @since(1.3) def getField(self, name): """ An expression that gets a field by

How do I truncate a PySpark dataframe of timestamp type to the day?

和自甴很熟 提交于 2019-12-02 12:00:22
问题 I have a PySpark dataframe that includes timestamps in a column (call the column 'dt'), like this: 2018-04-07 16:46:00 2018-03-06 22:18:00 When I execute: SELECT trunc(dt, 'day') as day ...I expected: 2018-04-07 00:00:00 2018-03-06 00:00:00 But I got: null null How do I truncate to the day instead of the hour? 回答1: You use wrong function. trunc supports only a few formats: Returns date truncated to the unit specified by the format. :param format: 'year', 'yyyy', 'yy' or 'month', 'mon', 'mm'

How to increase the default precision and scale while loading data from oracle using spark-sql

懵懂的女人 提交于 2019-12-02 11:16:06
问题 Trying to load a data from oracle table where I have few columns hold floating point values , some times it holds upto DecimalType(40,20) i.e. 20 digits after point. Currently when I load its columns using var local_ora_df: DataFrameReader = ora_df; local_ora_df.option("partitionColumn", "FISCAL_YEAR") local_ora_df .option("schema",schema) .option("dbtable", query) .load() It is holding 10 digits after point i.e. decimal(38,10) (nullable = true) If I want to increase digits after point while

PySpark When item in list

梦想与她 提交于 2019-12-02 07:26:28
Following is the action I'm trying to achieve: types = ["200","300"] def Count(ID): cnd = F.when((**F.col("type") in types**), 1).otherwise(F.lit(0)) return F.sum(cnd).alias("CountTypes") The syntax in bold is not correct, any suggestions how to get the right syntax here for PySpark? I'm not sure about what you are trying to achieve but here is the correct syntax : types = ["200","300"] from pyspark.sql import functions as F cnd = F.when(F.col("type").isin(types),F.lit(1)).otherwise(F.lit(0)) sum_on_cnd = F.sum(cnd).alias("count_types") # Column<b'sum(CASE WHEN (type IN (200, 300)) THEN 1 ELSE

How do I truncate a PySpark dataframe of timestamp type to the day?

烈酒焚心 提交于 2019-12-02 07:09:08
I have a PySpark dataframe that includes timestamps in a column (call the column 'dt'), like this: 2018-04-07 16:46:00 2018-03-06 22:18:00 When I execute: SELECT trunc(dt, 'day') as day ...I expected: 2018-04-07 00:00:00 2018-03-06 00:00:00 But I got: null null How do I truncate to the day instead of the hour? You use wrong function. trunc supports only a few formats : Returns date truncated to the unit specified by the format. :param format: 'year', 'yyyy', 'yy' or 'month', 'mon', 'mm' Use date_trunc instead : Returns timestamp truncated to the unit specified by the format. :param format: