apache-spark | 易学教程

AWS S3 : Spark - java.lang.IllegalArgumentException: URI is not absolute… while saving dataframe to s3 location as json

阅读更多关于 AWS S3 : Spark - java.lang.IllegalArgumentException: URI is not absolute… while saving dataframe to s3 location as json

问题 I am getting strange error while saving dataframe to AWS S3. df.coalesce(1).write.mode(SaveMode.Overwrite) .json(s"s3://myawsacc/results/") In the same location I was able to insert the data from spark-shell . and is working... spark.sparkContext.parallelize(1 to 4).toDF.write.mode(SaveMode.Overwrite) .format("com.databricks.spark.csv") .save(s"s3://myawsacc/results/") My question is why its working in spark-shell and is not working via spark-submit ? Is there any logic/properties

Replace groupByKey with reduceByKey in Spark

阅读更多关于 Replace groupByKey with reduceByKey in Spark

问题 Hello I often need to use groupByKey in my code but I know it's a very heavy operation. Since I'm working to improve performance I was wondering if my approach to remove all groupByKey calls is efficient. I was used to create an RDD from another RDD and creating pair of type (Int, Int) rdd1 = [(1, 2), (1, 3), (2 , 3), (2, 4), (3, 5)] and since I needed to obtain something like this: [(1, [2, 3]), (2 , [3, 4]), (3, [5])] what I used was out = rdd1.groupByKey but since this approach might be

Would S3 Select speed up Spark analyses on Parquet files?

阅读更多关于 Would S3 Select speed up Spark analyses on Parquet files?

问题 You can use S3 Select with Spark on Amazon EMR and with Databricks, but only for CSV and JSON files. I am guessing that S3 Select isn't offered for columnar file formats because it wouldn't help that much. Let's say we have a data lake of people with first_name , last_name and country columns. If the data is stored as CSV files and you run a query like peopleDF.select("first_name").distinct().count() , then S3 will transfer all the data for all the columns to the ec2 cluster to run the

Would S3 Select speed up Spark analyses on Parquet files?

阅读更多关于 Would S3 Select speed up Spark analyses on Parquet files?

Apply custom function to cells of selected columns of a data frame in PySpark

阅读更多关于 Apply custom function to cells of selected columns of a data frame in PySpark

Spark & Scala: saveAsTextFile() exception

阅读更多关于 Spark & Scala: saveAsTextFile() exception

问题 I'm new to Spark & Scala and I got exception after calling saveAsTextFile(). Hope someone can help... Here is my input.txt: Hello World, I'm a programmer Hello World, I'm a programmer This is the info after running "spark-shell" on CMD: C:\Users\Nhan Tran>spark-shell Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://DLap:4040 Spark context available as 'sc' (master = local[

Spark & Scala: saveAsTextFile() exception

阅读更多关于 Spark & Scala: saveAsTextFile() exception

Spark & Scala: saveAsTextFile() exception

阅读更多关于 Spark & Scala: saveAsTextFile() exception

Performance decrease for huge amount of columns. Pyspark

阅读更多关于 Performance decrease for huge amount of columns. Pyspark

问题 I met problem with processing of spark wide dataframe (about 9000 columns and sometimes more). Task: Create wide DF via groupBy and pivot. Transform columns to vector and processing in to KMeans from pyspark.ml. So I made extensive frame and try to create vector with VectorAssembler, cached it and trained on it KMeans. It took about 11 minutes for assembling and 2 minutes for KMeans for 7 different count of clusters on my pc in standalone mode for frame ~500x9000. Another side this processing

Performance decrease for huge amount of columns. Pyspark

阅读更多关于 Performance decrease for huge amount of columns. Pyspark