pyspark | 易学教程

Exploding Array in Batches of size 'n'

阅读更多关于 Exploding Array in Batches of size 'n'

问题 Looking to explode a nested array w/ Spark into batches. The column below is a nested array from an XML files. Now attempting to write the time series data into batches in order to write over to a NoSQL database. For example: +-------+-----------------------+ | ID | Example | +-------+-----------------------+ | A| [[1,2],[3,4],[5,6]] | +-------+-----------------------+ Output with batches of size 2 +-------+-----------------------+ | ID | Example | +-------+-----------------------+ | A| [[1,2

pyspark and reduceByKey: how to make a simple sum

阅读更多关于 pyspark and reduceByKey: how to make a simple sum

问题 I am trying some code in Spark (pyspark) for an assignment. First time I use this environment, so for sure I miss something… I have a simple dataset called c_views. If I run c_views.collect() I get […] (u'ABC', 100), (u'DEF', 200), (u'XXX', 50), (u'XXX', 70)] […] What I need to achieve is the sum across all words . So my guess is that I should get something like: (u'ABC', 100), (u'DEF', 200), (u'XXX', 120) So what I am trying to do is (following the hints in the assignment): first I define

Pyspark UDF column on Dataframe

阅读更多关于 Pyspark UDF column on Dataframe

问题 I'm trying to create a new column on a dataframe based on the values of some columns. It's returning null in all cases. Anyone know what's going wrong with this simple example? df = pd.DataFrame([[0,1,0],[1,0,0],[1,1,1]],columns = ['Foo','Bar','Baz']) spark_df = spark.createDataFrame(df) def get_profile(): if 'Foo'==1: return 'Foo' elif 'Bar' == 1: return 'Bar' elif 'Baz' ==1 : return 'Baz' spark_df = spark_df.withColumn('get_profile', lit(get_profile())) spark_df.show() Foo Bar Baz get

How to build Spark 1.2 with Maven (gives java.io.IOException: Cannot run program “javac”)?

阅读更多关于 How to build Spark 1.2 with Maven (gives java.io.IOException: Cannot run program “javac”)?

问题 I am trying to build Spark 1.2 with Maven. My goal is to use PySpark with YARN on Hadoop 2.2. I saw that this was only possible by building Spark with Maven. First, is this true? If it is true, what is the problem in the log below? How do I correct this? C:\Spark\spark-1.2.0>mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package Picked up _JAVA_OPTIONS: -Djava.net.preferIPv4Stack=true [INFO] Scanning for projects... [INFO] ----------------------------------------------------

How to classify images using Spark and Caffe

阅读更多关于 How to classify images using Spark and Caffe

问题 I am using Caffe to do image classification, can I am using MAC OS X, Pyhton. Right now I know how to classify a list of images using Caffe with Spark python, but if I want to make it faster, I want to use Spark. Therefore, I tried to apply the image classification on each element of an RDD, the RDD created from a list of image_path. However, Spark does not allow me to do so. Here is my code: This is the code for image classification: # display image name, class number, predicted label def

Spark : how can i create local dataframe in each executor

阅读更多关于 Spark : how can i create local dataframe in each executor

问题 In spark scala is there a way to create local dataframe in executors like pandas in pyspark . In mappartitions method i want to convert iterator to local dataframe (like pandas dataframe in python) so that dataframe features can be used instead of hand coding them on iterators. 回答1: That is not possible. Dataframe is a distributed collection in Spark. And Dataframes can only be created on driver node (i.e. outside of transformations/actions). Additionally, in Spark you cannot execute

Reading Multiple S3 Folders / Paths Into PySpark

阅读更多关于 Reading Multiple S3 Folders / Paths Into PySpark

问题 I am conducting a big data analysis using PySpark. I am able to import all CSV files, stored in a particular folder of a particular bucket, using the following command: df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('file:///home/path/datafolder/data2014/*.csv') (where * acts like a wildcard) The issues I have are the following: What if I want to do my analysis on 2014 and 2015 data i.e. file 1 is .load('file:///home/path/SFweather

Combine two rows in spark based on a condition in pyspark

阅读更多关于 Combine two rows in spark based on a condition in pyspark

问题 I have input record in following format: Input data format I want the data to be transofmmed in the following format: Output data format I want to combine my 2 rows based on the condition type. As per my knowledge I need to take the composite key of the 3 data fields and compare the type fields once they are equal. Can someone please help me with the implementation in Spark using Python? EDIT: Following is my try using RDD in pyspark record = spark.read.csv("wasb:///records.csv",header=True)

Broadcast Annoy object in Spark (for nearest neighbors)?

阅读更多关于 Broadcast Annoy object in Spark (for nearest neighbors)?

问题 As Spark's mllib doesn't have nearest-neighbors functionality, I'm trying to use Annoy for approximate Nearest Neighbors. I try to broadcast the Annoy object and pass it to workers; however, it does not operate as expected. Below is code for reproducibility (to be run in PySpark). The problem is highlighted in the difference seen when using Annoy with vs without Spark. from annoy import AnnoyIndex import random random.seed(42) f = 40 t = AnnoyIndex(f) # Length of item vector that will be

Broadcast Annoy object in Spark (for nearest neighbors)?

阅读更多关于 Broadcast Annoy object in Spark (for nearest neighbors)?