apache-spark-sql

Pyspark: how to add Date + numeric value format

别等时光非礼了梦想. 提交于 2020-08-11 09:31:12
问题 I have a 2 dataframes looks like the following: First df1 TEST_schema = StructType([StructField("description", StringType(), True),\ StructField("date", StringType(), True)\ ]) TEST_data = [('START',20200622),('END',20201018)] rdd3 = sc.parallelize(TEST_data) df1 = sqlContext.createDataFrame(TEST_data, TEST_schema) df1.show() +-----------+--------+ |description| date| +-----------+--------+ | START|20200701| | END|20201003| +-----------+--------+ And second df2 TEST_schema = StructType(

how to solve this use-case, any way to use , array, struct , explode and structure data?

二次信任 提交于 2020-08-10 19:44:24
问题 I am using spark-sql-2.4.1v with Java 8. I need to calculate percentiles such as 25,75,90 for some given data. Given source dataset: val df = Seq( (10, "1/15/2018", 0.010680705, 10, 0.619875458, 0.010680705, "east"), (10, "1/15/2018", 0.006628853, 4, 0.16039063, 0.01378215, "west"), (10, "1/15/2018", 0.01378215, 20, 0.082049528, 0.010680705, "east"), (10, "1/15/2018", 0.810680705, 6, 0.819875458, 0.702228853, "west"), (10, "1/15/2018", 0.702228853, 30, 0.916039063, 0.810680705, "east"), (11,

otherwise-clause not working as expect , whats wrong here?

两盒软妹~` 提交于 2020-08-10 19:33:07
问题 I am using spark-sql-2.4.1v how to do various joins depend on the value of column I need get multiple look up values of map_val column for given value columns as show below. Sample data: val data = List( ("20", "score", "school", "2018-03-31", 14 , 12), ("21", "score", "school", "2018-03-31", 13 , 13), ("22", "rate", "school", "2018-03-31", 11 , 14), ("21", "rate", "school", "2018-03-31", 13 , 12) ) val df = data.toDF("id", "code", "entity", "date", "value1", "value2") df.show +---+-----+----

otherwise-clause not working as expect , whats wrong here?

浪尽此生 提交于 2020-08-10 19:32:12
问题 I am using spark-sql-2.4.1v how to do various joins depend on the value of column I need get multiple look up values of map_val column for given value columns as show below. Sample data: val data = List( ("20", "score", "school", "2018-03-31", 14 , 12), ("21", "score", "school", "2018-03-31", 13 , 13), ("22", "rate", "school", "2018-03-31", 11 , 14), ("21", "rate", "school", "2018-03-31", 13 , 12) ) val df = data.toDF("id", "code", "entity", "date", "value1", "value2") df.show +---+-----+----

Spark structured streaming - Filter Phoenix table by streaming dataset

青春壹個敷衍的年華 提交于 2020-08-10 19:04:06
问题 I am building a Spark structured streaming job that does the below, Streaming source, val small_df = spark.readStream .format("kafka") .load() small_df.createOrReplaceTempView("small_df") A dataframe - Phoenix load val phoenixDF = spark.read.format("org.apache.phoenix.spark") .option("table", "my_table") .option("zkUrl", "zk") .load() phoenixDF.createOrReplaceTempView("phoenix_tbl") Then, spark sql statement to join(on primary_key) with another small dataframe to filter records. val

Spark structured streaming - Filter Phoenix table by streaming dataset

走远了吗. 提交于 2020-08-10 19:04:06
问题 I am building a Spark structured streaming job that does the below, Streaming source, val small_df = spark.readStream .format("kafka") .load() small_df.createOrReplaceTempView("small_df") A dataframe - Phoenix load val phoenixDF = spark.read.format("org.apache.phoenix.spark") .option("table", "my_table") .option("zkUrl", "zk") .load() phoenixDF.createOrReplaceTempView("phoenix_tbl") Then, spark sql statement to join(on primary_key) with another small dataframe to filter records. val

Cross Join for calculation in Spark SQL

余生长醉 提交于 2020-08-10 18:56:29
问题 I have a temporary view with only 1 record/value and I want to use that value to calculate the age of the customers present in another big table (with 100 M rows). I used a CROSS JOIN clause, which is resulting in a performance issue. Is there a better approach to implement this requirement which is will perform better ? Will a broadcast hint be suitable in this scenario ? What is the recommended approach to tackle such scenarios ? Reference table: (contains only 1 value) create temporary

All executors dead MinHash LSH PySpark approxSimilarityJoin self-join on EMR cluster

。_饼干妹妹 提交于 2020-08-09 13:35:23
问题 I run into problems when calling Spark's MinHashLSH's approxSimilarityJoin on a dataframe of (name_id, name) combinations. A summary of the problem I try to solve: I have a dataframe of around 30 million unique (name_id, name) combinations for company names. Some of those names refer to the same company, but are (i) either misspelled, and/or (ii) include additional names. Performing fuzzy string matching for every combination is not possible. To reduce the number of fuzzy string matching

Pyspark : how to code complicated dataframe calculation lead sum

本秂侑毒 提交于 2020-08-09 08:54:07
问题 I have given dataframe that looks like this. THIS dataframe is sorted by date, and col1 is just some random value. TEST_schema = StructType([StructField("date", StringType(), True),\ StructField("col1", IntegerType(), True),\ ]) TEST_data = [('2020-08-01',3),('2020-08-02',1),('2020-08-03',-1),('2020-08-04',-1),('2020-08-05',3),\ ('2020-08-06',-1),('2020-08-07',6),('2020-08-08',4),('2020-08-09',5)] rdd3 = sc.parallelize(TEST_data) TEST_df = sqlContext.createDataFrame(TEST_data, TEST_schema)

Speed up InMemoryFileIndex for Spark SQL job with large number of input files

徘徊边缘 提交于 2020-08-07 07:47:46
问题 I have an apache spark sql job (using Datasets), coded in Java, that get's it's input from between 70,000 to 150,000 files. It appears to take anywhere from 45 minutes to 1.5 hours to build the InMemoryFileIndex. There are no logs, very low network usage, and almost no CPU usage during this time. Here's a sample of what I see in the std output: 24698 [main] INFO org.spark_project.jetty.server.handler.ContextHandler - Started o.s.j.s.ServletContextHandler@32ec9c90{/static/sql,null,AVAILABLE,