apache-spark-1.4 | 易学教程

Cannot start spark-shell

阅读更多关于 Cannot start spark-shell

问题 I am using Spark 1.4.1. I can use spark-submit without problem. But when I ran ~/spark/bin/spark-shell I got the error below I have configured SPARK_HOME and JAVA_HOME . However, It was OK with Spark 1.2 15/10/08 02:40:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Failed to initialize compiler: object scala.runtime in compiler mirror not found. ** Note that as of 2.8 scala does not assume use of the java

How to load history data when starting Spark Streaming process, and calculate running aggregations

阅读更多关于 How to load history data when starting Spark Streaming process, and calculate running aggregations

问题 I have some sales-related JSON data in my ElasticSearch cluster, and I would like to use Spark Streaming (using Spark 1.4.1) to dynamically aggregate incoming sales events from my eCommerce website via Kafka, to have a current view to the user's total sales (in terms of revenue and products). What's not really clear to me from the docs I read is how I can load the history data from ElasticSearch upon the start of the Spark application, and to calculate for example the overall revenue per user

Find size of data stored in rdd from a text file in apache spark

阅读更多关于 Find size of data stored in rdd from a text file in apache spark

问题 I am new to Apache Spark (version 1.4.1). I wrote a small code to read a text file and stored its data in Rdd . Is there a way by which I can get the size of data in rdd . This is my code : import org.apache.spark.SparkContext import org.apache.spark.rdd.RDD import org.apache.spark.util.SizeEstimator import org.apache.spark.sql.Row object RddSize { def main(args: Array[String]) { val sc = new SparkContext("local", "data size") val FILE_LOCATION = "src/main/resources/employees.csv" val

Spark 1.4 image for Google Cloud?

阅读更多关于 Spark 1.4 image for Google Cloud?

问题 With bdutil, the latest version of tarball I can find is on spark 1.3.1: gs://spark-dist/spark-1.3.1-bin-hadoop2.6.tgz There are a few new DataFrame features in Spark 1.4 that I want to use. Any chance the Spark 1.4 image be available for bdutil, or any workaround? UPDATE: Following the suggestion from Angus Davis, I downloaded and pointed to spark-1.4.1-bin-hadoop2.6.tgz, the deployment went well; however, run into error when calling SqlContext.parquetFile(). I cannot explain why this

Slow or incomplete saveAsParquetFile from EMR Spark to S3

阅读更多关于 Slow or incomplete saveAsParquetFile from EMR Spark to S3

问题 I have a piece of code that creates a DataFrame and persists it to S3. Below creates a DataFrame of 1000 rows and 100 columns, populated by math.Random . I'm running this on a cluster with 4 x r3.8xlarge worker nodes, and configuring plenty of memory. I've tried with the maximum number of executors, and one executor per node. // create some random data for performance and scalability testing val df = sqlContext.range(0,1000).map(x => Row.fromSeq((1 to 100).map(y => math.Random))) df

How to load history data when starting Spark Streaming process, and calculate running aggregations

阅读更多关于 How to load history data when starting Spark Streaming process, and calculate running aggregations

I have some sales-related JSON data in my ElasticSearch cluster, and I would like to use Spark Streaming (using Spark 1.4.1) to dynamically aggregate incoming sales events from my eCommerce website via Kafka, to have a current view to the user's total sales (in terms of revenue and products). What's not really clear to me from the docs I read is how I can load the history data from ElasticSearch upon the start of the Spark application, and to calculate for example the overall revenue per user (based on the history, and the incoming sales from Kafka). I have the following (working) code to

How to start a Spark Shell using pyspark in Windows?

阅读更多关于 How to start a Spark Shell using pyspark in Windows?

问题 I am a beginner in Spark and trying to follow instructions from here on how to initialize Spark shell from Python using cmd: http://spark.apache.org/docs/latest/quick-start.html But when I run in cmd the following: C:\Users\Alex\Desktop\spark-1.4.1-bin-hadoop2.4\>c:\Python27\python bin\pyspark then I receive the following error message: File "bin\pyspark", line 21 export SPARK_HOME="$(cd ="$(cd "`dirname "$0"`"/..; pwd)" SyntaxError: invalid syntax What am I doing wrong here? P.S. When in cmd

How to handle null entries in SparkR

阅读更多关于 How to handle null entries in SparkR

I have a SparkSQL DataFrame. Some entries in this data are empty but they don't behave like NULL or NA. How could I remove them? Any ideas? In R I can easily remove them but in sparkR it say that there is a problem with the S4 system/methods. Thanks. SparkR Column provides a long list of useful methods including isNull and isNotNull : > people_local <- data.frame(Id=1:4, Age=c(21, 18, 30, NA)) > people <- createDataFrame(sqlContext, people_local) > head(people) Id Age 1 1 21 2 2 18 3 3 NA > filter(people, isNotNull(people$Age)) %>% head() Id Age 1 1 21 2 2 18 3 3 30 > filter(people, isNull

How to handle null entries in SparkR

阅读更多关于 How to handle null entries in SparkR

问题 I have a SparkSQL DataFrame. Some entries in this data are empty but they don't behave like NULL or NA. How could I remove them? Any ideas? In R I can easily remove them but in sparkR it say that there is a problem with the S4 system/methods. Thanks. 回答1: SparkR Column provides a long list of useful methods including isNull and isNotNull : > people_local <- data.frame(Id=1:4, Age=c(21, 18, 30, NA)) > people <- createDataFrame(sqlContext, people_local) > head(people) Id Age 1 1 21 2 2 18 3 3

How to optimize shuffle spill in Apache Spark application

阅读更多关于 How to optimize shuffle spill in Apache Spark application

I am running a Spark streaming application with 2 workers. Application has a join and an union operations. All the batches are completing successfully but noticed that shuffle spill metrics are not consistent with input data size or output data size (spill memory is more than 20 times). Please find the spark stage details in the below image: After researching on this, found that Shuffle spill happens when there is not sufficient memory for shuffle data. Shuffle spill (memory) - size of the deserialized form of the data in memory at the time of spilling shuffle spill (disk) - size of the