apache-spark

AWS S3 : Spark - java.lang.IllegalArgumentException: URI is not absolute… while saving dataframe to s3 location as json

非 Y 不嫁゛ 提交于 2021-02-07 04:28:21
问题 I am getting strange error while saving dataframe to AWS S3. df.coalesce(1).write.mode(SaveMode.Overwrite) .json(s"s3://myawsacc/results/") In the same location I was able to insert the data from spark-shell . and is working... spark.sparkContext.parallelize(1 to 4).toDF.write.mode(SaveMode.Overwrite) .format("com.databricks.spark.csv") .save(s"s3://myawsacc/results/") My question is why its working in spark-shell and is not working via spark-submit ? Is there any logic/properties

Replace groupByKey with reduceByKey in Spark

大憨熊 提交于 2021-02-07 04:28:19
问题 Hello I often need to use groupByKey in my code but I know it's a very heavy operation. Since I'm working to improve performance I was wondering if my approach to remove all groupByKey calls is efficient. I was used to create an RDD from another RDD and creating pair of type (Int, Int) rdd1 = [(1, 2), (1, 3), (2 , 3), (2, 4), (3, 5)] and since I needed to obtain something like this: [(1, [2, 3]), (2 , [3, 4]), (3, [5])] what I used was out = rdd1.groupByKey but since this approach might be

Would S3 Select speed up Spark analyses on Parquet files?

久未见 提交于 2021-02-07 03:45:38
问题 You can use S3 Select with Spark on Amazon EMR and with Databricks, but only for CSV and JSON files. I am guessing that S3 Select isn't offered for columnar file formats because it wouldn't help that much. Let's say we have a data lake of people with first_name , last_name and country columns. If the data is stored as CSV files and you run a query like peopleDF.select("first_name").distinct().count() , then S3 will transfer all the data for all the columns to the ec2 cluster to run the

Would S3 Select speed up Spark analyses on Parquet files?

|▌冷眼眸甩不掉的悲伤 提交于 2021-02-07 03:42:20
问题 You can use S3 Select with Spark on Amazon EMR and with Databricks, but only for CSV and JSON files. I am guessing that S3 Select isn't offered for columnar file formats because it wouldn't help that much. Let's say we have a data lake of people with first_name , last_name and country columns. If the data is stored as CSV files and you run a query like peopleDF.select("first_name").distinct().count() , then S3 will transfer all the data for all the columns to the ec2 cluster to run the

Apply custom function to cells of selected columns of a data frame in PySpark

和自甴很熟 提交于 2021-02-07 03:32:39
问题 Let's say I have a data frame which looks like this: +---+-----------+-----------+ | id| address1| address2| +---+-----------+-----------+ | 1|address 1.1|address 1.2| | 2|address 2.1|address 2.2| +---+-----------+-----------+ I would like to apply a custom function directly to the strings in the address1 and address2 columns, for example: def example(string1, string2): name_1 = string1.lower().split(' ') name_2 = string2.lower().split(' ') intersection_count = len(set(name_1) & set(name_2))

Spark & Scala: saveAsTextFile() exception

橙三吉。 提交于 2021-02-07 03:31:45
问题 I'm new to Spark & Scala and I got exception after calling saveAsTextFile(). Hope someone can help... Here is my input.txt: Hello World, I'm a programmer Hello World, I'm a programmer This is the info after running "spark-shell" on CMD: C:\Users\Nhan Tran>spark-shell Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://DLap:4040 Spark context available as 'sc' (master = local[

Spark & Scala: saveAsTextFile() exception

二次信任 提交于 2021-02-07 03:31:41
问题 I'm new to Spark & Scala and I got exception after calling saveAsTextFile(). Hope someone can help... Here is my input.txt: Hello World, I'm a programmer Hello World, I'm a programmer This is the info after running "spark-shell" on CMD: C:\Users\Nhan Tran>spark-shell Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://DLap:4040 Spark context available as 'sc' (master = local[

Spark & Scala: saveAsTextFile() exception

孤人 提交于 2021-02-07 03:27:42
问题 I'm new to Spark & Scala and I got exception after calling saveAsTextFile(). Hope someone can help... Here is my input.txt: Hello World, I'm a programmer Hello World, I'm a programmer This is the info after running "spark-shell" on CMD: C:\Users\Nhan Tran>spark-shell Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://DLap:4040 Spark context available as 'sc' (master = local[

Performance decrease for huge amount of columns. Pyspark

我们两清 提交于 2021-02-06 20:18:54
问题 I met problem with processing of spark wide dataframe (about 9000 columns and sometimes more). Task: Create wide DF via groupBy and pivot. Transform columns to vector and processing in to KMeans from pyspark.ml. So I made extensive frame and try to create vector with VectorAssembler, cached it and trained on it KMeans. It took about 11 minutes for assembling and 2 minutes for KMeans for 7 different count of clusters on my pc in standalone mode for frame ~500x9000. Another side this processing

Performance decrease for huge amount of columns. Pyspark

空扰寡人 提交于 2021-02-06 20:14:01
问题 I met problem with processing of spark wide dataframe (about 9000 columns and sometimes more). Task: Create wide DF via groupBy and pivot. Transform columns to vector and processing in to KMeans from pyspark.ml. So I made extensive frame and try to create vector with VectorAssembler, cached it and trained on it KMeans. It took about 11 minutes for assembling and 2 minutes for KMeans for 7 different count of clusters on my pc in standalone mode for frame ~500x9000. Another side this processing