apache-spark | 易学教程

The <K> class in a groupByKey(…) has a Map among its members. The groupByKey operation fails on an “un-comparable” problem

阅读更多关于 The class in a groupByKey(…) has a Map among its members. The groupByKey operation fails on an “un-comparable” problem

问题 I have a class Entreprise that have primives data types and a Map on another class : Etablissement that is only made of primitive data types. public class Entreprise implements Comparable<Entreprise> { /** Liste des établissements de l'entreprise. */ private Map<String, Etablissement> etablissements = new HashMap<>(); /** Sigle de l'entreprise */ private String sigle; /** Nom de naissance */ private String nomNaissance; /** Nom d'usage */ private String nomUsage; ... @Override public int

The <K> class in a groupByKey(…) has a Map among its members. The groupByKey operation fails on an “un-comparable” problem

阅读更多关于 The class in a groupByKey(…) has a Map among its members. The groupByKey operation fails on an “un-comparable” problem

Why listing leaf files and directories is taking too much time to start in pyspark

阅读更多关于 Why listing leaf files and directories is taking too much time to start in pyspark

问题 I have spark application which read multiple s3 files and do certain transformation. This is how I am reading the files: input_df_s3_path = spark.read.csv("s3a://bucket1/s3_path.csv") s3_path_list = input_df_s3_path.select('_c0').rdd.map(lambda row : row[0]).collect() input_df = sqlContext.read.option("mergeSchema", "false").parquet(*s3_path_list).na.drop() So creating a datafrme from a csv which consists all the s3 path, converting those paths into a list and passing that list in read

Spark no such field METASTORE_CLIENT_FACTORY_CLASS

阅读更多关于 Spark no such field METASTORE_CLIENT_FACTORY_CLASS

问题 I am trying to query a hive table using spark in Java. My hive tables are in an EMR cluster 5.12. Spark version is 2.2.1 and Hive 2.3.2. When I ssh into the machine and I connect to the spark-shell I am able to query the hive tables with no issues. But when I try to query using a custom jar then I get the following exception: java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder': at org.apache.spark.sql.SparkSession$.org$apache$spark

(Apache Beam) Cannot increase executor memory - it is fixed at 1024M despite using multiple settings

阅读更多关于 (Apache Beam) Cannot increase executor memory - it is fixed at 1024M despite using multiple settings

问题 I am running an apache beam workload on Spark. I initialized the workers with 32GB of memory (slave run with -c 2 -m 32G ). Spark submit sets driver memory to 30g and executor memory to 16g. However, executors fail with java.lang.OutOfMemoryError: Java heap space . The master gui indicates that memory per executor is 1024M. In addition, I see that all java processes are launched with -Xmx 1024m . This means spark-submit doesn't propagate it's executor settings to the executors. Pipeline

(Apache Beam) Cannot increase executor memory - it is fixed at 1024M despite using multiple settings

阅读更多关于 (Apache Beam) Cannot increase executor memory - it is fixed at 1024M despite using multiple settings

What tools to use to visualize logical and physical query plans?

阅读更多关于 What tools to use to visualize logical and physical query plans?

问题 I am familiar explain() (also WebUI). I was curious whether there are any tools that generate an image of the tree structure of the logical/physical plan before/after optimizations. That is the information returned by explain() as an image. 回答1: A picture like a PNG or JPG? Never heard of one myself, but you can see the physical plan using web UI (that you've already mentioned). The other phases of query execution are available using TreeNode methods which (among many methods that could help

What tools to use to visualize logical and physical query plans?

阅读更多关于 What tools to use to visualize logical and physical query plans?

How to concatenate multiple columns in PySpark with a separator?

阅读更多关于 How to concatenate multiple columns in PySpark with a separator?

问题 I have a pyspark Dataframe , I would like to join 3 columns. id | column_1 | column_2 | column_3 -------------------------------------------- 1 | 12 | 34 | 67 -------------------------------------------- 2 | 45 | 78 | 90 -------------------------------------------- 3 | 23 | 93 | 56 -------------------------------------------- I want to join the 3 columns : column_1, column_2, column_3 in only one adding between there value "-" Expect result: id | column_1 | column_2 | column_3 | column_join -

How to concatenate multiple columns in PySpark with a separator?

阅读更多关于 How to concatenate multiple columns in PySpark with a separator?