apache-spark | 易学教程

How to run Spark locally on Windows using eclipse in java

阅读更多关于 How to run Spark locally on Windows using eclipse in java

问题 I'm trying to test Mllib's implementation of SVM. I want to run their java example locally on windows, using eclipse. I've downloaded Spark 1.3.1 pre-built for Hadoop 2.6 . When i try to run the example code, i get: 15/06/11 16:17:09 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. What should i change in order to be able to run the example code in this setup? 回答1: Create

Do Spark/Parquet partitions maintain ordering?

阅读更多关于 Do Spark/Parquet partitions maintain ordering?

问题 If I partition a data set, will it be in the correct order when I read it back? For example, consider the following pyspark code: # read a csv df = sql_context.read.csv(input_filename) # add a hash column hash_udf = udf(lambda customer_id: hash(customer_id) % 4, IntegerType()) df = df.withColumn('hash', hash_udf(df['customer_id'])) # write out to parquet df.write.parquet(output_path, partitionBy=['hash']) # read back the file df2 = sql_context.read.parquet(output_path) I am partitioning on a

How to Split rows to different columns in Spark DataFrame/DataSet?

阅读更多关于 How to Split rows to different columns in Spark DataFrame/DataSet?

问题 Suppose I have data set like : Name | Subject | Y1 | Y2 A | math | 1998| 2000 B | | 1996| 1999 | science | 2004| 2005 I want to split rows of this data set such that Y2 column will be eliminated like : Name | Subject | Y1 A | math | 1998 A | math | 1999 A | math | 2000 B | | 1996 B | | 1997 B | | 1998 B | | 1999 | science | 2004 | science | 2005 Can someone suggest something here ? I hope I had made my query clear. Thanks in advance. 回答1: I think you only need to create an udf to create the

Premature end of Content-Length delimited message body SparkException while reading from S3 using Pyspark

阅读更多关于 Premature end of Content-Length delimited message body SparkException while reading from S3 using Pyspark

问题 I am using the below code to read S3 csv file from my local machine . from pyspark import SparkConf, SparkContext from pyspark.sql import SparkSession import configparser import os conf = SparkConf() conf.set('spark.jars', '/usr/local/spark/jars/aws-java-sdk-1.7.4.jar,/usr/local/spark/jars/hadoop-aws-2.7.4.jar') #Tried by setting this, but failed conf.set('spark.executor.memory', '8g') conf.set('spark.driver.memory', '8g') spark_session = SparkSession.builder \ .config(conf=conf) \ .appName(

Spark: Task not serializable Exception in forEach loop in Java

阅读更多关于 Spark: Task not serializable Exception in forEach loop in Java

问题 I'm trying to iterate over JavaPairRDD and perform some calculations with keys and values of JavaPairRDD. Then output result for each JavaPair into processedData list. What I already tried: make variables, that I use inside of lambda function static. make methods, that I call from lambda foreach loop static. added implements Serializable Here is my code: List<String> processedData = new ArrayList<>(); JavaPairRDD<WebLabGroupObject, Iterable<WebLabPurchasesDataObject>> groupedByWebLabData

Spark on Windows 10. 'Files\Spark\bin\..\jars“”\' is not recognized as an internal or external command

阅读更多关于 Spark on Windows 10. 'Files\Spark\bin\..\jars“”\' is not recognized as an internal or external command

问题 I am very frustrated by Spark. An evening wasted thinking that I was doing something wrong but I have uninstalled and reinstalled several times, following multiple guides that all indicate a very similar path. On cmd prompt, I am trying to run: pyspark or spark-shell The steps I followed include downloading a pre-built package from: https://spark.apache.org/downloads.html including spark 2.0.2 with hadoop 2.3 and spark 2.1.0 with hadoop 2.7. Neither work and I get this error: 'Files\Spark\bin

Spark with json4s, parse function raise java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.parse$default$3()Z

阅读更多关于 Spark with json4s, parse function raise java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.parse$default$3()Z

问题 I wrote a function to process stream by spark streaming. And I encountered java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.parse$default$3()Z I have checked about the spark version(1.6.0) and scala version(2.10.5). It is consistent with json4s jar version(json4s-jackson_2.10-3.3.0.jar). I can not figure out what happened. And the following is the function code: import org.json4s._ import org.json4s.jackson.Serialization.{read => JsonRead} import org.json4s.jackson.JsonMethods._

Find latest file pyspark

阅读更多关于 Find latest file pyspark

问题 So I've figured out how to find the latest file using python. Now I'm wondering if I can find the latest file using pyspark. Currently I specify a path but I'd like pyspark to get the latest modified file. Current code looks like this: df = sc.read.csv("Path://to/file", header=True, inderSchema=True) Thanks in advance for your help. 回答1: I copied the code to get the HDFS API to work with PySpark from this answer: Pyspark: get list of files/directories on HDFS path URI = sc._gateway.jvm.java

How to write spark dataframe in a single file in local system without using coalesce

阅读更多关于 How to write spark dataframe in a single file in local system without using coalesce

问题 I want to generate an avro file from a pyspark dataframe and currently I am doing coalesce as below df = df.coalesce(1) df.write.format('avro').save('file:///mypath') But this is leading to memory issues now as all the data will be fetched to memory before writing and my data size is growing consistently everyday. So I want to write the data by each partition so that the data would be written to disk in chunks and doesnot raise OOM issues. I found that toLocalIterator helps in achieving this.

application time processing depending on the number of computing nodes

阅读更多关于 application time processing depending on the number of computing nodes

问题 Maybe this question is a little bit strange... But I'll try to ask it. I have a Spark application and I test it on a different count of computing nodes. (This count I change from one to four nodes). All nodes are equal - they have equal CPUs and equal size of RAM. All application settings (like parallelism level or partitions count) are constantly. Here the results of application time processing depending on the number of computing nodes: 1 node -- 127 minutes 2 nodes -- 71 minutes 3 nodes --