pyspark

What does the pyspark.sql.functions.window function's 'startTime' argument do and window.start?

倖福魔咒の 提交于 2020-01-01 19:11:16
问题 The example is as follows: df=spark.createDataFrame([ (1,"2017-05-15 23:12:26",2.5), (1,"2017-05-09 15:26:58",3.5), (1,"2017-05-18 15:26:58",3.6), (2,"2017-05-15 15:24:25",4.8), (3,"2017-05-25 15:14:12",4.6)],["index","time","val"]).orderBy("index","time") df.collect() +-----+-------------------+---+ |index| time|val| +-----+-------------------+---+ | 1|2017-05-09 15:26:58|3.5| | 1|2017-05-15 23:12:26|2.5| | 1|2017-05-18 15:26:58|3.6| | 2|2017-05-15 15:24:25|4.8| | 3|2017-05-25 15:14:12|4.6|

Pyspark: How to transform json strings in a dataframe column

徘徊边缘 提交于 2020-01-01 17:49:18
问题 The following is more or less straight python code which functionally extracts exactly as I want. The data schema for the column I'm filtering out within the dataframe is basically a json string. However, I had to greatly bump up the memory requirement for this and I'm only running on a single node. Using a collect is probably bad and creating all of this on a single node really isn't taking advantage of the distributed nature of Spark. I'd like a more Spark centric solution. Can anyone help

how to store grouped data into json in pyspark

瘦欲@ 提交于 2020-01-01 17:38:09
问题 I am new to pyspark I have a dataset which looks like (just a snapshot of few columns) I want to group my data by key. My key is CONCAT(a.div_nbr,a.cust_nbr) My ultimate goal is to convert the data into JSON, formated like this k1[{v1,v2,....},{v1,v2,....}], k2[{v1,v2,....},{v1,v2,....}],.... e.g 248138339 [{ PRECIMA_ID:SCP 00248 0000138339, PROD_NBR:5553505, PROD_DESC:Shot and a Beer Battered Onion Rings (5553505 and 9285840) , PROD_BRND:Molly's Kitchen,PACK_SIZE:4/2.5 LB, QTY_UOM:CA } , {

load external libraries inside pyspark code

China☆狼群 提交于 2020-01-01 17:25:30
问题 I have a spark cluster I use in local mode. I want to read a csv with the databricks external library spark.csv. I start my app as follows: import os import sys os.environ["SPARK_HOME"] = "/home/mebuddy/Programs/spark-1.6.0-bin-hadoop2.6" spark_home = os.environ.get('SPARK_HOME', None) sys.path.insert(0, spark_home + "/python") sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip')) from pyspark import SparkContext, SparkConf, SQLContext try: sc except NameError: print

Pyspark calculate custom distance between all vectors in a RDD

限于喜欢 提交于 2020-01-01 16:45:32
问题 I have a RDD consisting of dense vectors which contain probability distribution like below [DenseVector([0.0806, 0.0751, 0.0786, 0.0753, 0.077, 0.0753, 0.0753, 0.0777, 0.0801, 0.0748, 0.0768, 0.0764, 0.0773]), DenseVector([0.2252, 0.0422, 0.0864, 0.0441, 0.0592, 0.0439, 0.0433, 0.071, 0.1644, 0.0405, 0.0581, 0.0528, 0.0691]), DenseVector([0.0806, 0.0751, 0.0786, 0.0753, 0.077, 0.0753, 0.0753, 0.0777, 0.0801, 0.0748, 0.0768, 0.0764, 0.0773]), DenseVector([0.0924, 0.0699, 0.083, 0.0706, 0.0766,

how to connect spark streaming with cassandra?

巧了我就是萌 提交于 2020-01-01 15:32:12
问题 I'm using Cassandra v2.1.12 Spark v1.4.1 Scala 2.10 and cassandra is listening on rpc_address:127.0.1.1 rpc_port:9160 For example, to connect kafka and spark-streaming, while listening to kafka every 4 seconds, I have the following spark job sc = SparkContext(conf=conf) stream=StreamingContext(sc,4) map1={'topic_name':1} kafkaStream = KafkaUtils.createStream(stream, 'localhost:2181', "name", map1) And spark-streaming keeps listening to kafka broker every 4 seconds and outputs the contents.

how to connect spark streaming with cassandra?

一曲冷凌霜 提交于 2020-01-01 15:32:07
问题 I'm using Cassandra v2.1.12 Spark v1.4.1 Scala 2.10 and cassandra is listening on rpc_address:127.0.1.1 rpc_port:9160 For example, to connect kafka and spark-streaming, while listening to kafka every 4 seconds, I have the following spark job sc = SparkContext(conf=conf) stream=StreamingContext(sc,4) map1={'topic_name':1} kafkaStream = KafkaUtils.createStream(stream, 'localhost:2181', "name", map1) And spark-streaming keeps listening to kafka broker every 4 seconds and outputs the contents.

How to convert datetime from string format into datetime format in pyspark?

回眸只為那壹抹淺笑 提交于 2020-01-01 14:42:01
问题 I created a dataframe using sqlContext and I have a problem with the datetime format as it is identified as string. df2 = sqlContext.createDataFrame(i[1]) df2.show df2.printSchema() Result: 2016-07-05T17:42:55.238544+0900 2016-07-05T17:17:38.842567+0900 2016-06-16T19:54:09.546626+0900 2016-07-05T17:27:29.227750+0900 2016-07-05T18:44:12.319332+0900 string (nullable = true) Since the datetime schema is a string, I want to change it to datetime format as follows: df3 = df2.withColumn('_1', df2['

How to convert datetime from string format into datetime format in pyspark?

只谈情不闲聊 提交于 2020-01-01 14:40:14
问题 I created a dataframe using sqlContext and I have a problem with the datetime format as it is identified as string. df2 = sqlContext.createDataFrame(i[1]) df2.show df2.printSchema() Result: 2016-07-05T17:42:55.238544+0900 2016-07-05T17:17:38.842567+0900 2016-06-16T19:54:09.546626+0900 2016-07-05T17:27:29.227750+0900 2016-07-05T18:44:12.319332+0900 string (nullable = true) Since the datetime schema is a string, I want to change it to datetime format as follows: df3 = df2.withColumn('_1', df2['

How can I build a CoordinateMatrix in Spark using a DataFrame?

给你一囗甜甜゛ 提交于 2020-01-01 11:58:10
问题 I am trying to use the Spark implementation of the ALS algorithm for recommendation systems, so I built the DataFrame depicted below, as training data: |--------------|--------------|--------------| | userId | itemId | rating | |--------------|--------------|--------------| Now, I would like to create a sparse matrix, to represent the interactions between every user and every item. The matrix will be sparse because if there is no interaction between a user and an item, the corresponding value