apache-spark

The <K> class in a groupByKey(…) has a Map among its members. The groupByKey operation fails on an “un-comparable” problem

北城以北 提交于 2021-01-05 05:57:56
问题 I have a class Entreprise that have primives data types and a Map on another class : Etablissement that is only made of primitive data types. public class Entreprise implements Comparable<Entreprise> { /** Liste des établissements de l'entreprise. */ private Map<String, Etablissement> etablissements = new HashMap<>(); /** Sigle de l'entreprise */ private String sigle; /** Nom de naissance */ private String nomNaissance; /** Nom d'usage */ private String nomUsage; ... @Override public int

The <K> class in a groupByKey(…) has a Map among its members. The groupByKey operation fails on an “un-comparable” problem

ε祈祈猫儿з 提交于 2021-01-05 05:50:48
问题 I have a class Entreprise that have primives data types and a Map on another class : Etablissement that is only made of primitive data types. public class Entreprise implements Comparable<Entreprise> { /** Liste des établissements de l'entreprise. */ private Map<String, Etablissement> etablissements = new HashMap<>(); /** Sigle de l'entreprise */ private String sigle; /** Nom de naissance */ private String nomNaissance; /** Nom d'usage */ private String nomUsage; ... @Override public int

Why listing leaf files and directories is taking too much time to start in pyspark

十年热恋 提交于 2021-01-04 07:07:44
问题 I have spark application which read multiple s3 files and do certain transformation. This is how I am reading the files: input_df_s3_path = spark.read.csv("s3a://bucket1/s3_path.csv") s3_path_list = input_df_s3_path.select('_c0').rdd.map(lambda row : row[0]).collect() input_df = sqlContext.read.option("mergeSchema", "false").parquet(*s3_path_list).na.drop() So creating a datafrme from a csv which consists all the s3 path, converting those paths into a list and passing that list in read

Spark no such field METASTORE_CLIENT_FACTORY_CLASS

瘦欲@ 提交于 2021-01-04 07:02:45
问题 I am trying to query a hive table using spark in Java. My hive tables are in an EMR cluster 5.12. Spark version is 2.2.1 and Hive 2.3.2. When I ssh into the machine and I connect to the spark-shell I am able to query the hive tables with no issues. But when I try to query using a custom jar then I get the following exception: java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder': at org.apache.spark.sql.SparkSession$.org$apache$spark

(Apache Beam) Cannot increase executor memory - it is fixed at 1024M despite using multiple settings

谁说我不能喝 提交于 2021-01-04 05:39:28
问题 I am running an apache beam workload on Spark. I initialized the workers with 32GB of memory (slave run with -c 2 -m 32G ). Spark submit sets driver memory to 30g and executor memory to 16g. However, executors fail with java.lang.OutOfMemoryError: Java heap space . The master gui indicates that memory per executor is 1024M. In addition, I see that all java processes are launched with -Xmx 1024m . This means spark-submit doesn't propagate it's executor settings to the executors. Pipeline

(Apache Beam) Cannot increase executor memory - it is fixed at 1024M despite using multiple settings

╄→尐↘猪︶ㄣ 提交于 2021-01-04 05:38:25
问题 I am running an apache beam workload on Spark. I initialized the workers with 32GB of memory (slave run with -c 2 -m 32G ). Spark submit sets driver memory to 30g and executor memory to 16g. However, executors fail with java.lang.OutOfMemoryError: Java heap space . The master gui indicates that memory per executor is 1024M. In addition, I see that all java processes are launched with -Xmx 1024m . This means spark-submit doesn't propagate it's executor settings to the executors. Pipeline

What tools to use to visualize logical and physical query plans?

≯℡__Kan透↙ 提交于 2021-01-04 05:38:20
问题 I am familiar explain() (also WebUI). I was curious whether there are any tools that generate an image of the tree structure of the logical/physical plan before/after optimizations. That is the information returned by explain() as an image. 回答1: A picture like a PNG or JPG? Never heard of one myself, but you can see the physical plan using web UI (that you've already mentioned). The other phases of query execution are available using TreeNode methods which (among many methods that could help

What tools to use to visualize logical and physical query plans?

孤人 提交于 2021-01-04 05:35:18
问题 I am familiar explain() (also WebUI). I was curious whether there are any tools that generate an image of the tree structure of the logical/physical plan before/after optimizations. That is the information returned by explain() as an image. 回答1: A picture like a PNG or JPG? Never heard of one myself, but you can see the physical plan using web UI (that you've already mentioned). The other phases of query execution are available using TreeNode methods which (among many methods that could help

How to concatenate multiple columns in PySpark with a separator?

。_饼干妹妹 提交于 2021-01-04 05:32:46
问题 I have a pyspark Dataframe , I would like to join 3 columns. id | column_1 | column_2 | column_3 -------------------------------------------- 1 | 12 | 34 | 67 -------------------------------------------- 2 | 45 | 78 | 90 -------------------------------------------- 3 | 23 | 93 | 56 -------------------------------------------- I want to join the 3 columns : column_1, column_2, column_3 in only one adding between there value "-" Expect result: id | column_1 | column_2 | column_3 | column_join -

How to concatenate multiple columns in PySpark with a separator?

…衆ロ難τιáo~ 提交于 2021-01-04 05:32:06
问题 I have a pyspark Dataframe , I would like to join 3 columns. id | column_1 | column_2 | column_3 -------------------------------------------- 1 | 12 | 34 | 67 -------------------------------------------- 2 | 45 | 78 | 90 -------------------------------------------- 3 | 23 | 93 | 56 -------------------------------------------- I want to join the 3 columns : column_1, column_2, column_3 in only one adding between there value "-" Expect result: id | column_1 | column_2 | column_3 | column_join -