apache-spark

Ordering of rows in JavaRdds after union

不羁的心 提交于 2021-01-28 08:08:45
问题 I am trying to find out any information on the ordering of the rows in a RDD. Here is what I am trying to do: Rdd1, Rdd2 Rdd3 = Rdd1.union(rdd2); in Rdd3, is there any guarantee that rdd1 records will appear first and rdd2 afterwards? For my tests I saw this behaviorunion happening but wasn't able to find it in any docs. just FI, I really do not care about the ordering of RDDs in itself (i.e. rdd2's or rdd1's data order is really not concern but after union Rdd1 record data must come first is

Spark HBase/BigTable - Wide/sparse dataframe persistence

不羁的心 提交于 2021-01-28 08:03:36
问题 I want to persist to BigTable a very wide Spark Dataframe (>100'000 columns) that is sparsely populated (>99% of values are null) while keeping only non-null values (to avoid storage cost). Is there a way to specify in Spark to ignore nulls when writing? Thanks ! 来源: https://stackoverflow.com/questions/65647574/spark-hbase-bigtable-wide-sparse-dataframe-persistence

pyspark dataframe with json column to aggregate the json elements into a new column and remove duplicated

血红的双手。 提交于 2021-01-28 08:02:36
问题 I am trying to read a pyspark dataframe with json column on databricks. The dataframe: year month json_col 2010 09 [{"p_id":"vfdvtbe"}, {"p_id":"cdscs"}, {"p_id":"usdvwq"}] 2010 09 [{"p_id":"ujhbe"}, {"p_id":"cdscs"}, {"p_id":"yjev"}] 2007 10 [{"p_id":"ukerge"}, {"p_id":"ikrtw"}, {"p_id":"ikwca"}] 2007 10 [{"p_id":"unvwq"}, {"p_id":"cqwcq"}, {"p_id":"ikwca"}] I need a new dataframe with all duplicated "p_id" are removed and aggregated by year and month year month p_id (string) 2010 09 [

Apache Spark 2.0 (PySpark) - DataFrame Error Multiple sources found for csv

↘锁芯ラ 提交于 2021-01-28 08:01:10
问题 I am trying to create a dataframe using the following code in Spark 2.0. While executing the code in Jupyter/Console, I am facing the below error. Can someone help me how to get rid of this error? Error: Py4JJavaError: An error occurred while calling o34.csv. : java.lang.RuntimeException: Multiple sources found for csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, com.databricks.spark.csv.DefaultSource15), please specify the fully qualified class name. at scala.sys.package$

object databricks is not a member of package com

妖精的绣舞 提交于 2021-01-28 07:52:45
问题 I am trying to use Stanford NLP library in Spark2 using Zeppelin (HDP 2.6). Apparently there is wrapper built by Databricks for the Stanford NLP library for Spark. Link: https://github.com/databricks/spark-corenlp I have downloaded the jar for the above wrapper from here and also downloaded Stanford NLP jars from here. Then I have added both sets of jars as dependencies in Spark2 interpreter settings of Zeppelin and restarted the interpreter. Still the below sample program gives the error

How to undo ALTER TABLE … ADD PARTITION without deleting data

时光怂恿深爱的人放手 提交于 2021-01-28 07:07:30
问题 Let's suppose I have two hive tables, table_1 and table_2 . I use: ALTER TABLE table_2 ADD PARTITION (col=val) LOCATION [table_1_location] Now, table_2 will have the data in table_1 at the partition where col = val . What I want to do is reverse this process. I want table_2 not to have the partition at col=val , and I want table_1 to keep its original data. How can I do this? 回答1: Make your table EXTERNAL first: ALTER TABLE table_2 SET TBLPROPERTIES('EXTERNAL'='TRUE'); Then drop partition,

Spark: create a nested schema

隐身守侯 提交于 2021-01-28 06:50:41
问题 With spark, import spark.implicits._ val data = Seq( (1, ("value11", "value12")), (2, ("value21", "value22")), (3, ("value31", "value32")) ) val df = data.toDF("id", "v1") df.printSchema() The result is the following: root |-- id: integer (nullable = false) |-- v1: struct (nullable = true) | |-- _1: string (nullable = true) | |-- _2: string (nullable = true) Now if I want to create the schema myself, how should I process? val schema = StructType(Array( StructField("id", IntegerType),

Building hierarchy using Spark

非 Y 不嫁゛ 提交于 2021-01-28 06:48:36
问题 Imagine I've got such a tree: - One - One one - One two - One two one - One two two - One two three - One two three one - One three - One three one - One three two - One three three - One four - One five Data wise it's quite simple too, just a child-parent relationship: +-------------------+---------------+ | Child | Parent | +-------------------+---------------+ | One | | | One one | One | | One two | One | | One two one | One two | | One two two | One two | | One two three | One two | | One

Spark task runs on only one executor

本小妞迷上赌 提交于 2021-01-28 06:01:32
问题 Hello everyone first and foremost i'm aware of the existence of this thread, Task is running on only one executor in spark. However this is not my case as i'm using repartition(n) on my dataframe. Basically i'm loading a DataFrame by fetching data from an ElasticSearch index through Spark as follows: spark = SparkSession.builder \ .appName("elastic") \ .master("yarn")\ .config('spark.submit.deployMode','client')\ .config("spark.jars",pathElkJar) \ .enableHiveSupport() \ .getOrCreate() es

Combine value part of Tuple2 which is a map, into single map grouping by the key of Tuple2

↘锁芯ラ 提交于 2021-01-28 05:45:13
问题 I am doing this in Scala and Spark. I have and Dataset of Tuple2 as Dataset[(String, Map[String, String])] . Below is and example of the values in the Dataset . (A, {1->100, 2->200, 3->100}) (B, {1->400, 4->300, 5->900}) (C, {6->100, 4->200, 5->100}) (B, {1->500, 9->300, 11->900}) (C, {7->100, 8->200, 5->800}) If you notice, the key or first element of the Tuple can be repeated. Also, the corresponding map of the same Tuples' key can have duplicate keys in the map (second part of Tuple2). I