pyspark | 易学教程

PySpark Numeric Window Group By

阅读更多关于 PySpark Numeric Window Group By

问题 I'd like to be able to have Spark group by a step size, as opposed to just single values. Is there anything in spark similar to PySpark 2.x's window function for numeric (non-date) values? Something along the lines of: sqlContext = SQLContext(sc) df = sqlContext.createDataFrame([10, 11, 12, 13], "integer").toDF("foo") res = df.groupBy(window("foo", step=2, start=10)).count() 回答1: You can reuse timestamp one and express parameters in seconds. Tumbling: from pyspark.sql.functions import col,

Pyspark Split Columns

阅读更多关于 Pyspark Split Columns

问题 from pyspark.sql import Row, functions as F row = Row("UK_1","UK_2","Date","Cat",'Combined') agg = '' agg = 'Cat' tdf = (sc.parallelize ([ row(1,1,'12/10/2016',"A",'Water^World'), row(1,2,None,'A','Sea^Born'), row(2,1,'14/10/2016','B','Germ^Any'), row(3,3,'!~2016/2/276','B','Fin^Land'), row(None,1,'26/09/2016','A','South^Korea'), row(1,1,'12/10/2016',"A",'North^America'), row(1,2,None,'A','South^America'), row(2,1,'14/10/2016','B','New^Zealand'), row(None,None,'!~2016/2/276','B','South^Africa

Sum operation on PySpark DataFrame giving TypeError when type is fine

阅读更多关于 Sum operation on PySpark DataFrame giving TypeError when type is fine

问题 I have such DataFrame in PySpark (this is the result of a take(3), the dataframe is very big): sc = SparkContext() df = [Row(owner=u'u1', a_d=0.1), Row(owner=u'u2', a_d=0.0), Row(owner=u'u1', a_d=0.3)] the same owner will have more rows. What I need to do is summing the values of the field a_d per owner, after grouping, as b = df.groupBy('owner').agg(sum('a_d').alias('a_d_sum')) but this throws error TypeError: unsupported operand type(s) for +: 'int' and 'str' However, the schema contains

【Pyspark】一列变多列、分割一行中的list分割转为多列 explode，多列变一列

阅读更多关于【Pyspark】一列变多列、分割一行中的list分割转为多列 explode，多列变一列

【Pyspark】一列变多列分割一行中的list分割转为多列 explode 官方例子： P ython pyspark.sql.functions.explode() Examples https://www.programcreek.com/python/example/98237/pyspark.sql.functions.explode 根据某个字段内容进行分割，然后生成多行，这时可以使用explode方法 Eg: df.explode("c3","c3_"){time: String => time.split(" ")}.show(False) https://blog.csdn.net/anshuai_aw1/article/details/87881079#4.4%C2%A0%E5%88%86%E5%89%B2%EF%BC%9A%E8%A1%8C%E8%BD%AC%E5%88%97 Eg: from pyspark.sql import Row eDF = sqlContext.createDataFrame([Row(a=1, intlist=[1,2,3], mapfield={"a": "b"})]) eDF.select(explode(eDF.intlist).alias("anInt")).collect() Out: [Row(anInt=1

I have an issue with regex extract with multiple matches

阅读更多关于 I have an issue with regex extract with multiple matches

问题 I am trying to extract 60 ML and 0.5 ML from the string "60 ML of paracetomol and 0.5 ML of XYZ" . This string is part of a column X in spark dataframe. Though I am able to test my regex code to extract 60 ML and 0.5 ML in regex validator, I am not able to extract it using regexp_extract as it targets only 1st matches. Hence I am getting only 60 ML. Can you suggest me the best way of doing it using UDF ? 回答1: Here is how you can do it with a python UDF: from pyspark.sql.types import * from

Pyspark, error:input doesn't have expected number of values required by the schema and extra trailing comma after columns

阅读更多关于 Pyspark, error:input doesn't have expected number of values required by the schema and extra trailing comma after columns

问题 First I made two tables(RDD) to use following commands rdd1=sc.textFile('checkouts').map(lambda line:line.split(',')).map(lambda fields:((fields[0],fields[3],fields[5]), 1) ) rdd2=sc.textFile('inventory2').map(lambda line:line.split(',')).map(lambda fields:((fields[0],fields[8],fields[10]), 1) ) The keys in first RDD are BibNum, ItemCollection and CheckoutDateTime. And when I checked the values for first RDD to use rdd1.take(2) it shows [((u'BibNum', u'ItemCollection', u'CheckoutDateTime'), 1

Pyspark, error:input doesn't have expected number of values required by the schema and extra trailing comma after columns

阅读更多关于 Pyspark, error:input doesn't have expected number of values required by the schema and extra trailing comma after columns

adding hours to timestamp in pyspark dymanically

阅读更多关于 adding hours to timestamp in pyspark dymanically

问题 import pyspark.sql.functions as F from datetime import datetime data = [ (1, datetime(2017, 3, 12, 3, 19, 58), 'Raising',2), (2, datetime(2017, 3, 12, 3, 21, 30), 'sleeping',1), (3, datetime(2017, 3, 12, 3, 29, 40), 'walking',3), (4, datetime(2017, 3, 12, 3, 31, 23), 'talking',5), (5, datetime(2017, 3, 12, 4, 19, 47), 'eating',6), (6, datetime(2017, 3, 12, 4, 33, 51), 'working',7), ] df.show() | id| testing_time|test_name|shift| | 1|2017-03-12 03:19:58| Raising| 2| | 2|2017-03-12 03:21:30|

adding hours to timestamp in pyspark dymanically

阅读更多关于 adding hours to timestamp in pyspark dymanically

How to implement EXISTS condition as like SQL in spark Dataframe

阅读更多关于 How to implement EXISTS condition as like SQL in spark Dataframe

问题 I am curious to know, how can i implement sql like exists clause in spark Dataframe way. 回答1: LEFT SEMI JOIN is equivalent to the EXISTS function in Spark. val cityDF= Seq(("Delhi","India"),("Kolkata","India"),("Mumbai","India"),("Nairobi","Kenya"),("Colombo","Srilanka")).toDF("City","Country") val CodeDF= Seq(("011","Delhi"),("022","Mumbai"),("033","Kolkata"),("044","Chennai")).toDF("Code","City") val finalDF= cityDF.join(CodeDF, cityDF("City") === CodeDF("City"), "left_semi") 回答2: If the