pyspark

PySpark Numeric Window Group By

♀尐吖头ヾ 提交于 2020-01-20 08:20:06
问题 I'd like to be able to have Spark group by a step size, as opposed to just single values. Is there anything in spark similar to PySpark 2.x's window function for numeric (non-date) values? Something along the lines of: sqlContext = SQLContext(sc) df = sqlContext.createDataFrame([10, 11, 12, 13], "integer").toDF("foo") res = df.groupBy(window("foo", step=2, start=10)).count() 回答1: You can reuse timestamp one and express parameters in seconds. Tumbling: from pyspark.sql.functions import col,

Pyspark Split Columns

痴心易碎 提交于 2020-01-20 06:53:12
问题 from pyspark.sql import Row, functions as F row = Row("UK_1","UK_2","Date","Cat",'Combined') agg = '' agg = 'Cat' tdf = (sc.parallelize ([ row(1,1,'12/10/2016',"A",'Water^World'), row(1,2,None,'A','Sea^Born'), row(2,1,'14/10/2016','B','Germ^Any'), row(3,3,'!~2016/2/276','B','Fin^Land'), row(None,1,'26/09/2016','A','South^Korea'), row(1,1,'12/10/2016',"A",'North^America'), row(1,2,None,'A','South^America'), row(2,1,'14/10/2016','B','New^Zealand'), row(None,None,'!~2016/2/276','B','South^Africa

Sum operation on PySpark DataFrame giving TypeError when type is fine

眉间皱痕 提交于 2020-01-20 05:53:05
问题 I have such DataFrame in PySpark (this is the result of a take(3), the dataframe is very big): sc = SparkContext() df = [Row(owner=u'u1', a_d=0.1), Row(owner=u'u2', a_d=0.0), Row(owner=u'u1', a_d=0.3)] the same owner will have more rows. What I need to do is summing the values of the field a_d per owner, after grouping, as b = df.groupBy('owner').agg(sum('a_d').alias('a_d_sum')) but this throws error TypeError: unsupported operand type(s) for +: 'int' and 'str' However, the schema contains

【Pyspark】 一列变多列 、分割 一行中的list分割转为多列 explode,多列变一列

只谈情不闲聊 提交于 2020-01-20 01:05:23
【Pyspark】 一列变多列 分割 一行中的list分割转为多列 explode 官方例子: P ython pyspark.sql.functions.explode() Examples https://www.programcreek.com/python/example/98237/pyspark.sql.functions.explode 根据 某个字段内容进行分割,然后生成多行 ,这时可以使用explode方法 Eg: df.explode("c3","c3_"){time: String => time.split(" ")}.show(False) https://blog.csdn.net/anshuai_aw1/article/details/87881079#4.4%C2%A0%E5%88%86%E5%89%B2%EF%BC%9A%E8%A1%8C%E8%BD%AC%E5%88%97 Eg: from pyspark.sql import Row eDF = sqlContext.createDataFrame([Row(a=1, intlist=[1,2,3], mapfield={"a": "b"})]) eDF.select(explode(eDF.intlist).alias("anInt")).collect() Out: [Row(anInt=1

I have an issue with regex extract with multiple matches

独自空忆成欢 提交于 2020-01-19 18:06:05
问题 I am trying to extract 60 ML and 0.5 ML from the string "60 ML of paracetomol and 0.5 ML of XYZ" . This string is part of a column X in spark dataframe. Though I am able to test my regex code to extract 60 ML and 0.5 ML in regex validator, I am not able to extract it using regexp_extract as it targets only 1st matches. Hence I am getting only 60 ML. Can you suggest me the best way of doing it using UDF ? 回答1: Here is how you can do it with a python UDF: from pyspark.sql.types import * from

Pyspark, error:input doesn't have expected number of values required by the schema and extra trailing comma after columns

我只是一个虾纸丫 提交于 2020-01-17 18:51:09
问题 First I made two tables(RDD) to use following commands rdd1=sc.textFile('checkouts').map(lambda line:line.split(',')).map(lambda fields:((fields[0],fields[3],fields[5]), 1) ) rdd2=sc.textFile('inventory2').map(lambda line:line.split(',')).map(lambda fields:((fields[0],fields[8],fields[10]), 1) ) The keys in first RDD are BibNum, ItemCollection and CheckoutDateTime. And when I checked the values for first RDD to use rdd1.take(2) it shows [((u'BibNum', u'ItemCollection', u'CheckoutDateTime'), 1

Pyspark, error:input doesn't have expected number of values required by the schema and extra trailing comma after columns

前提是你 提交于 2020-01-17 18:50:06
问题 First I made two tables(RDD) to use following commands rdd1=sc.textFile('checkouts').map(lambda line:line.split(',')).map(lambda fields:((fields[0],fields[3],fields[5]), 1) ) rdd2=sc.textFile('inventory2').map(lambda line:line.split(',')).map(lambda fields:((fields[0],fields[8],fields[10]), 1) ) The keys in first RDD are BibNum, ItemCollection and CheckoutDateTime. And when I checked the values for first RDD to use rdd1.take(2) it shows [((u'BibNum', u'ItemCollection', u'CheckoutDateTime'), 1

adding hours to timestamp in pyspark dymanically

夙愿已清 提交于 2020-01-17 18:37:06
问题 import pyspark.sql.functions as F from datetime import datetime data = [ (1, datetime(2017, 3, 12, 3, 19, 58), 'Raising',2), (2, datetime(2017, 3, 12, 3, 21, 30), 'sleeping',1), (3, datetime(2017, 3, 12, 3, 29, 40), 'walking',3), (4, datetime(2017, 3, 12, 3, 31, 23), 'talking',5), (5, datetime(2017, 3, 12, 4, 19, 47), 'eating',6), (6, datetime(2017, 3, 12, 4, 33, 51), 'working',7), ] df.show() | id| testing_time|test_name|shift| | 1|2017-03-12 03:19:58| Raising| 2| | 2|2017-03-12 03:21:30|

adding hours to timestamp in pyspark dymanically

孤人 提交于 2020-01-17 18:35:49
问题 import pyspark.sql.functions as F from datetime import datetime data = [ (1, datetime(2017, 3, 12, 3, 19, 58), 'Raising',2), (2, datetime(2017, 3, 12, 3, 21, 30), 'sleeping',1), (3, datetime(2017, 3, 12, 3, 29, 40), 'walking',3), (4, datetime(2017, 3, 12, 3, 31, 23), 'talking',5), (5, datetime(2017, 3, 12, 4, 19, 47), 'eating',6), (6, datetime(2017, 3, 12, 4, 33, 51), 'working',7), ] df.show() | id| testing_time|test_name|shift| | 1|2017-03-12 03:19:58| Raising| 2| | 2|2017-03-12 03:21:30|

How to implement EXISTS condition as like SQL in spark Dataframe

社会主义新天地 提交于 2020-01-17 17:15:03
问题 I am curious to know, how can i implement sql like exists clause in spark Dataframe way. 回答1: LEFT SEMI JOIN is equivalent to the EXISTS function in Spark. val cityDF= Seq(("Delhi","India"),("Kolkata","India"),("Mumbai","India"),("Nairobi","Kenya"),("Colombo","Srilanka")).toDF("City","Country") val CodeDF= Seq(("011","Delhi"),("022","Mumbai"),("033","Kolkata"),("044","Chennai")).toDF("Code","City") val finalDF= cityDF.join(CodeDF, cityDF("City") === CodeDF("City"), "left_semi") 回答2: If the