pyspark | 易学教程

What does the pyspark.sql.functions.window function's 'startTime' argument do and window.start?

阅读更多关于 What does the pyspark.sql.functions.window function's 'startTime' argument do and window.start?

问题 The example is as follows: df=spark.createDataFrame([ (1,"2017-05-15 23:12:26",2.5), (1,"2017-05-09 15:26:58",3.5), (1,"2017-05-18 15:26:58",3.6), (2,"2017-05-15 15:24:25",4.8), (3,"2017-05-25 15:14:12",4.6)],["index","time","val"]).orderBy("index","time") df.collect() +-----+-------------------+---+ |index| time|val| +-----+-------------------+---+ | 1|2017-05-09 15:26:58|3.5| | 1|2017-05-15 23:12:26|2.5| | 1|2017-05-18 15:26:58|3.6| | 2|2017-05-15 15:24:25|4.8| | 3|2017-05-25 15:14:12|4.6|

Pyspark: How to transform json strings in a dataframe column

阅读更多关于 Pyspark: How to transform json strings in a dataframe column

问题 The following is more or less straight python code which functionally extracts exactly as I want. The data schema for the column I'm filtering out within the dataframe is basically a json string. However, I had to greatly bump up the memory requirement for this and I'm only running on a single node. Using a collect is probably bad and creating all of this on a single node really isn't taking advantage of the distributed nature of Spark. I'd like a more Spark centric solution. Can anyone help

how to store grouped data into json in pyspark

阅读更多关于 how to store grouped data into json in pyspark

问题 I am new to pyspark I have a dataset which looks like (just a snapshot of few columns) I want to group my data by key. My key is CONCAT(a.div_nbr,a.cust_nbr) My ultimate goal is to convert the data into JSON, formated like this k1[{v1,v2,....},{v1,v2,....}], k2[{v1,v2,....},{v1,v2,....}],.... e.g 248138339 [{ PRECIMA_ID:SCP 00248 0000138339, PROD_NBR:5553505, PROD_DESC:Shot and a Beer Battered Onion Rings (5553505 and 9285840) , PROD_BRND:Molly's Kitchen,PACK_SIZE:4/2.5 LB, QTY_UOM:CA } , {

load external libraries inside pyspark code

阅读更多关于 load external libraries inside pyspark code

问题 I have a spark cluster I use in local mode. I want to read a csv with the databricks external library spark.csv. I start my app as follows: import os import sys os.environ["SPARK_HOME"] = "/home/mebuddy/Programs/spark-1.6.0-bin-hadoop2.6" spark_home = os.environ.get('SPARK_HOME', None) sys.path.insert(0, spark_home + "/python") sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip')) from pyspark import SparkContext, SparkConf, SQLContext try: sc except NameError: print

Pyspark calculate custom distance between all vectors in a RDD

阅读更多关于 Pyspark calculate custom distance between all vectors in a RDD

问题 I have a RDD consisting of dense vectors which contain probability distribution like below [DenseVector([0.0806, 0.0751, 0.0786, 0.0753, 0.077, 0.0753, 0.0753, 0.0777, 0.0801, 0.0748, 0.0768, 0.0764, 0.0773]), DenseVector([0.2252, 0.0422, 0.0864, 0.0441, 0.0592, 0.0439, 0.0433, 0.071, 0.1644, 0.0405, 0.0581, 0.0528, 0.0691]), DenseVector([0.0806, 0.0751, 0.0786, 0.0753, 0.077, 0.0753, 0.0753, 0.0777, 0.0801, 0.0748, 0.0768, 0.0764, 0.0773]), DenseVector([0.0924, 0.0699, 0.083, 0.0706, 0.0766,

how to connect spark streaming with cassandra?

阅读更多关于 how to connect spark streaming with cassandra?

问题 I'm using Cassandra v2.1.12 Spark v1.4.1 Scala 2.10 and cassandra is listening on rpc_address:127.0.1.1 rpc_port:9160 For example, to connect kafka and spark-streaming, while listening to kafka every 4 seconds, I have the following spark job sc = SparkContext(conf=conf) stream=StreamingContext(sc,4) map1={'topic_name':1} kafkaStream = KafkaUtils.createStream(stream, 'localhost:2181', "name", map1) And spark-streaming keeps listening to kafka broker every 4 seconds and outputs the contents.

how to connect spark streaming with cassandra?

阅读更多关于 how to connect spark streaming with cassandra?

How to convert datetime from string format into datetime format in pyspark?

阅读更多关于 How to convert datetime from string format into datetime format in pyspark?

问题 I created a dataframe using sqlContext and I have a problem with the datetime format as it is identified as string. df2 = sqlContext.createDataFrame(i[1]) df2.show df2.printSchema() Result: 2016-07-05T17:42:55.238544+0900 2016-07-05T17:17:38.842567+0900 2016-06-16T19:54:09.546626+0900 2016-07-05T17:27:29.227750+0900 2016-07-05T18:44:12.319332+0900 string (nullable = true) Since the datetime schema is a string, I want to change it to datetime format as follows: df3 = df2.withColumn('_1', df2['

How to convert datetime from string format into datetime format in pyspark?

阅读更多关于 How to convert datetime from string format into datetime format in pyspark?

How can I build a CoordinateMatrix in Spark using a DataFrame?

阅读更多关于 How can I build a CoordinateMatrix in Spark using a DataFrame?

问题 I am trying to use the Spark implementation of the ALS algorithm for recommendation systems, so I built the DataFrame depicted below, as training data: |--------------|--------------|--------------| | userId | itemId | rating | |--------------|--------------|--------------| Now, I would like to create a sparse matrix, to represent the interactions between every user and every item. The matrix will be sparse because if there is no interaction between a user and an item, the corresponding value