pyspark-sql | 易学教程

What does “Correlated scalar subqueries must be Aggregated” mean?

阅读更多关于 What does “Correlated scalar subqueries must be Aggregated” mean?

问题 I use Spark 2.0. I'd like to execute the following SQL query: val sqlText = """ select f.ID as TID, f.BldgID as TBldgID, f.LeaseID as TLeaseID, f.Period as TPeriod, coalesce( (select f ChargeAmt from Fact_CMCharges f where f.BldgID = Fact_CMCharges.BldgID limit 1), 0) as TChargeAmt1, f.ChargeAmt as TChargeAmt2, l.EFFDATE as TBreakDate from Fact_CMCharges f join CMRECC l on l.BLDGID = f.BldgID and l.LEASID = f.LeaseID and l.INCCAT = f.IncomeCat and date_format(l.EFFDATE,'D')<>1 and f.Period

Applying a Window function to calculate differences in pySpark

阅读更多关于 Applying a Window function to calculate differences in pySpark

I am using pySpark , and have set up my dataframe with two columns representing a daily asset price as follows: ind = sc.parallelize(range(1,5)) prices = sc.parallelize([33.3,31.1,51.2,21.3]) data = ind.zip(prices) df = sqlCtx.createDataFrame(data,["day","price"]) I get upon applying df.show() : +---+-----+ |day|price| +---+-----+ | 1| 33.3| | 2| 31.1| | 3| 51.2| | 4| 21.3| +---+-----+ Which is fine and all. I would like to have another column that contains the day-to-day returns of the price column, i.e., something like (price(day2)-price(day1))/(price(day1)) After much research, I am told

Pyspark Replicate Row based on column value

阅读更多关于 Pyspark Replicate Row based on column value

问题 I would like to replicate all rows in my DataFrame based on the value of a given column on each row, and than index each new row. Suppose I have: Column A Column B T1 3 T2 2 I want the result to be: Column A Column B Index T1 3 1 T1 3 2 T1 3 3 T2 2 1 T2 2 2 I was able to to something similar with fixed values, but not by using the information found on the column. My current working code for fixed values is: idx = [lit(i) for i in range(1, 10)] df = df.withColumn('Index', explode(array( idx )

How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?

阅读更多关于 How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?

import numpy as np df = spark.createDataFrame( [(1, 1, None), (1, 2, float(5)), (1, 3, np.nan), (1, 4, None), (1, 5, float(10)), (1, 6, float('nan')), (1, 6, float('nan'))], ('session', "timestamp1", "id2")) Expected output dataframe with count of nan/null for each column Note: The previous questions I found in stack overflow only checks for null & not nan. Thats why i have created a new question. I know i can use isnull() function in spark to find number of Null values in Spark column but how to find Nan values in Spark dataframe? You can use method shown here and replace isNull with isnan :

Spark - Window with recursion? - Conditionally propagating values across rows

阅读更多关于 Spark - Window with recursion? - Conditionally propagating values across rows

I have the following dataframe showing the revenue of purchases. +-------+--------+-------+ |user_id|visit_id|revenue| +-------+--------+-------+ | 1| 1| 0| | 1| 2| 0| | 1| 3| 0| | 1| 4| 100| | 1| 5| 0| | 1| 6| 0| | 1| 7| 200| | 1| 8| 0| | 1| 9| 10| +-------+--------+-------+ Ultimately I want the new column purch_revenue to show the revenue generated by the purchase in every row. As a workaround, I have also tried to introduce a purchase identifier purch_id which is incremented each time a purchase was made. So this is listed just as a reference. +-------+--------+-------+-------------+------

pyspark mysql jdbc load An error occurred while calling o23.load No suitable driver

阅读更多关于 pyspark mysql jdbc load An error occurred while calling o23.load No suitable driver

I use docker image sequenceiq/spark on my Mac to study these spark examples , during the study process, I upgrade the spark inside that image to 1.6.1 according to this answer , and the error occurred when I start the Simple Data Operations example, here is what happened: when I run df = sqlContext.read.format("jdbc").option("url",url).option("dbtable","people").load() it raise a error, and the full stack with the pyspark console is as followed: Python 2.6.6 (r266:84292, Jul 23 2015, 15:22:56) [GCC 4.4.7 20120313 (Red Hat 4.4.7-11)] on linux2 Type "help", "copyright", "credits" or "license"

how to get max(date) from given set of data grouped by some fields using pyspark?

阅读更多关于 how to get max(date) from given set of data grouped by some fields using pyspark?

I have the data in the dataframe as below: datetime | userId | memberId | value | 2016-04-06 16:36:... | 1234 | 111 | 1 2016-04-06 17:35:... | 1234 | 222 | 5 2016-04-06 17:50:... | 1234 | 111 | 8 2016-04-06 18:36:... | 1234 | 222 | 9 2016-04-05 16:36:... | 4567 | 111 | 1 2016-04-06 17:35:... | 4567 | 222 | 5 2016-04-06 18:50:... | 4567 | 111 | 8 2016-04-06 19:36:... | 4567 | 222 | 9 I need to find the max(datetime) groupby userid,memberid. When I tried as below: df2 = df.groupBy('userId','memberId').max('datetime') I'm getting error as: org.apache.spark.sql.AnalysisException: "datetime" is not

PySpark - get row number for each row in a group

阅读更多关于 PySpark - get row number for each row in a group

Using pyspark, I'd like to be able to group a spark dataframe, sort the group, and then provide a row number. So Group Date A 2000 A 2002 A 2007 B 1999 B 2015 Would become Group Date row_num A 2000 0 A 2002 1 A 2007 2 B 1999 0 B 2015 1 user8419108 Use window function: from pyspark.sql.window import * from pyspark.sql.functions import row_number df.withColumn("row_num", row_number().over(Window.partitionBy("Group").orderBy("Date"))) The accepted solution almost has it right. Here is the solution based on the output requested in the question: df = spark.createDataFrame([("A", 2000), ("A", 2002),

PySpark: compute row maximum of the subset of columns and add to an exisiting dataframe

阅读更多关于 PySpark: compute row maximum of the subset of columns and add to an exisiting dataframe

I would like to compute the maximum of a subset of columns for each row and add it as a new column for the existing Dataframe . I managed to do this in very awkward way: def add_colmax(df,subset_columns,colnm): ''' calculate the maximum of the selected "subset_columns" from dataframe df for each row, new column containing row wise maximum is added to dataframe df. df: dataframe. It must contain subset_columns as subset of columns colnm: Name of the new column containing row-wise maximum of subset_columns subset_columns: the subset of columns from w ''' from pyspark.sql.functions import

Spark 2.0: Redefining SparkSession params through GetOrCreate and NOT seeing changes in WebUI

阅读更多关于 Spark 2.0: Redefining SparkSession params through GetOrCreate and NOT seeing changes in WebUI

问题 I'm using Spark 2.0 with PySpark. I am redefining SparkSession parameters through a GetOrCreate method that was introduced in 2.0: This method first checks whether there is a valid global default SparkSession, and if yes, return that one. If no valid global default SparkSession exists, the method creates a new SparkSession and assigns the newly created SparkSession as the global default. In case an existing SparkSession is returned, the config options specified in this builder will be applied

订阅 pyspark-sql