pyspark-sql

show distinct column values in pyspark dataframe: python

╄→гoц情女王★ 提交于 2019-11-29 21:18:08
Please suggest pyspark dataframe alternative for Pandas df['col'].unique() . I want to list out all the unique values in a pyspark dataframe column. Not the SQL type way (registertemplate then SQL query for distinct values). Also I don't need groupby->countDistinct , instead I want to check distinct VALUES in that column. Let's assume we're working with the following representation of data (two columns, k and v , where k contains three entries, two unique: +---+---+ | k| v| +---+---+ |foo| 1| |bar| 2| |foo| 3| +---+---+ With a Pandas dataframe: import pandas as pd p_df = pd.DataFrame([("foo",

Window function is not working on Pyspark sqlcontext

强颜欢笑 提交于 2019-11-29 17:29:54
I have a data frame and I want to roll up the data into 7days and do some aggregation on some of the function. I have a pyspark sql dataframe like ------ Sale_Date|P_1|P_2|P_3|G_1|G_2|G_3|Total_Sale|Sale_Amt|Promo_Disc_Amt | |2013-04-10| 1| 9| 1| 1| 1| 1| 1| 295.0|0.0| |2013-04-11| 1| 9| 1| 1| 1| 1| 3| 567.0|0.0| |2013-04-12| 1| 9| 1| 1| 1| 1| 2| 500.0|200.0| |2013-04-13| 1| 9| 1| 1| 1| 1| 1| 245.0|20.0| |2013-04-14| 1| 9| 1| 1| 1| 1| 1| 245.0|0.0| |2013-04-15| 1| 9| 1| 1| 1| 1| 2| 500.0|200.0| |2013-04-16| 1| 9| 1| 1| 1| 1| 1| 250.0|0.0| I have applied a window function over the data frame as

Spark SQL get max & min dynamically from datasource

回眸只為那壹抹淺笑 提交于 2019-11-29 15:55:08
I am using Spark SQL where I want to fetch whole data everyday from a Oracle table(consist of more than 1800k records). The application is hanging up when I read from Oracle hence I used concept of partitionColumn,lowerBound & upperBound . But,the problem is how can I get l owerBound & upperBound value of primary key column dynamically ?? Every day value of lowerBound & upperBound will be changing.Hence how can I get the boundary values of primary key column dynamically?? Can anyone guide me an sample example for my problem? Just fetch required values from the database: url = ... properties =

PySpark Numeric Window Group By

末鹿安然 提交于 2019-11-29 15:19:43
I'd like to be able to have Spark group by a step size, as opposed to just single values. Is there anything in spark similar to PySpark 2.x's window function for numeric (non-date) values? Something along the lines of: sqlContext = SQLContext(sc) df = sqlContext.createDataFrame([10, 11, 12, 13], "integer").toDF("foo") res = df.groupBy(window("foo", step=2, start=10)).count() hi-zir You can reuse timestamp one and express parameters in seconds. Tumbling: from pyspark.sql.functions import col, window df.withColumn( "window", window( col("foo").cast("timestamp"), windowDuration="2 seconds" ).cast

What does “Correlated scalar subqueries must be Aggregated” mean?

可紊 提交于 2019-11-29 14:08:18
I use Spark 2.0. I'd like to execute the following SQL query: val sqlText = """ select f.ID as TID, f.BldgID as TBldgID, f.LeaseID as TLeaseID, f.Period as TPeriod, coalesce( (select f ChargeAmt from Fact_CMCharges f where f.BldgID = Fact_CMCharges.BldgID limit 1), 0) as TChargeAmt1, f.ChargeAmt as TChargeAmt2, l.EFFDATE as TBreakDate from Fact_CMCharges f join CMRECC l on l.BLDGID = f.BldgID and l.LEASID = f.LeaseID and l.INCCAT = f.IncomeCat and date_format(l.EFFDATE,'D')<>1 and f.Period=EFFDateInt(l.EFFDATE) where f.ActualProjected = 'Lease' except( select * from TT1 t2 left semi join Fact

Pyspark Replicate Row based on column value

折月煮酒 提交于 2019-11-29 12:18:01
I would like to replicate all rows in my DataFrame based on the value of a given column on each row, and than index each new row. Suppose I have: Column A Column B T1 3 T2 2 I want the result to be: Column A Column B Index T1 3 1 T1 3 2 T1 3 3 T2 2 1 T2 2 2 I was able to to something similar with fixed values, but not by using the information found on the column. My current working code for fixed values is: idx = [lit(i) for i in range(1, 10)] df = df.withColumn('Index', explode(array( idx ) )) I tried to change: lit(i) for i in range(1, 10) to lit(i) for i in range(1, df['Column B']) and add

Apache spark dealing with case statements

大城市里の小女人 提交于 2019-11-29 12:07:29
问题 I am dealing with transforming SQL code to PySpark code and came across some SQL statements. I don't know how to approach case statments in pyspark? I am planning on creating a RDD and then using rdd.map and then do some logic checks. Is that the right approach? Please help! Basically I need to go through each line in the RDD or DF and based on some logic I need to edit one of the column values. case when (e."a" Like 'a%' Or e."b" Like 'b%') And e."aa"='BW' And cast(e."abc" as decimal(10,4))

How to cast DataFrame with Vector columns into RDD

被刻印的时光 ゝ 提交于 2019-11-29 11:55:13
I have a DataFrame (called df1 in Pyspark in which one of the columns is of type DenseVector . This is the schema of the dataframe. DataFrame[prediction: double, probability: vector, label: double] I try to convert it into an RDD using df1.rdd method. Then I execute count() on it but I get the following error message. Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/spark/python/pyspark/rdd.py", line 1006, in count return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File "/usr/lib/spark/python/pyspark/rdd.py", line 997, in sum return self

Spark SQL converting string to timestamp

妖精的绣舞 提交于 2019-11-29 11:36:09
问题 I'm new to Spark SQL and am trying to convert a string to a timestamp in a spark data frame. I have a string that looks like '2017-08-01T02:26:59.000Z' in a column called time_string My code to convert this string to timestamp is CAST (time_string AS Timestamp) But this gives me a timestamp of 2017-07-31 19:26:59 Why is it changing the time? Is there a way to do this without changing the time? Thanks for any help! 回答1: You could use unix_timestamp function to convert the utc formatted date to

PySpark Dataframe from Python Dictionary without Pandas

房东的猫 提交于 2019-11-29 11:24:32
I am trying to convert the following Python dict into PySpark DataFrame but I am not getting expected output. dict_lst = {'letters': ['a', 'b', 'c'], 'numbers': [10, 20, 30]} df_dict = sc.parallelize([dict_lst]).toDF() # Result not as expected df_dict.show() Is there a way to do this without using Pandas? Quoting myself : I find it's useful to think of the argument to createDataFrame() as a list of tuples where each entry in the list corresponds to a row in the DataFrame and each element of the tuple corresponds to a column. So the easiest thing is to convert your dictionary into this format.