apache-spark-sql | 易学教程

Pyspark: how to add Date + numeric value format

阅读更多关于 Pyspark: how to add Date + numeric value format

问题 I have a 2 dataframes looks like the following: First df1 TEST_schema = StructType([StructField("description", StringType(), True),\ StructField("date", StringType(), True)\ ]) TEST_data = [('START',20200622),('END',20201018)] rdd3 = sc.parallelize(TEST_data) df1 = sqlContext.createDataFrame(TEST_data, TEST_schema) df1.show() +-----------+--------+ |description| date| +-----------+--------+ | START|20200701| | END|20201003| +-----------+--------+ And second df2 TEST_schema = StructType(

how to solve this use-case, any way to use , array, struct , explode and structure data?

阅读更多关于 how to solve this use-case, any way to use , array, struct , explode and structure data?

问题 I am using spark-sql-2.4.1v with Java 8. I need to calculate percentiles such as 25,75,90 for some given data. Given source dataset: val df = Seq( (10, "1/15/2018", 0.010680705, 10, 0.619875458, 0.010680705, "east"), (10, "1/15/2018", 0.006628853, 4, 0.16039063, 0.01378215, "west"), (10, "1/15/2018", 0.01378215, 20, 0.082049528, 0.010680705, "east"), (10, "1/15/2018", 0.810680705, 6, 0.819875458, 0.702228853, "west"), (10, "1/15/2018", 0.702228853, 30, 0.916039063, 0.810680705, "east"), (11,

otherwise-clause not working as expect , whats wrong here?

阅读更多关于 otherwise-clause not working as expect , whats wrong here?

问题 I am using spark-sql-2.4.1v how to do various joins depend on the value of column I need get multiple look up values of map_val column for given value columns as show below. Sample data: val data = List( ("20", "score", "school", "2018-03-31", 14 , 12), ("21", "score", "school", "2018-03-31", 13 , 13), ("22", "rate", "school", "2018-03-31", 11 , 14), ("21", "rate", "school", "2018-03-31", 13 , 12) ) val df = data.toDF("id", "code", "entity", "date", "value1", "value2") df.show +---+-----+----

otherwise-clause not working as expect , whats wrong here?

阅读更多关于 otherwise-clause not working as expect , whats wrong here?

Spark structured streaming - Filter Phoenix table by streaming dataset

阅读更多关于 Spark structured streaming - Filter Phoenix table by streaming dataset

问题 I am building a Spark structured streaming job that does the below, Streaming source, val small_df = spark.readStream .format("kafka") .load() small_df.createOrReplaceTempView("small_df") A dataframe - Phoenix load val phoenixDF = spark.read.format("org.apache.phoenix.spark") .option("table", "my_table") .option("zkUrl", "zk") .load() phoenixDF.createOrReplaceTempView("phoenix_tbl") Then, spark sql statement to join(on primary_key) with another small dataframe to filter records. val

Spark structured streaming - Filter Phoenix table by streaming dataset

阅读更多关于 Spark structured streaming - Filter Phoenix table by streaming dataset

Cross Join for calculation in Spark SQL

阅读更多关于 Cross Join for calculation in Spark SQL

问题 I have a temporary view with only 1 record/value and I want to use that value to calculate the age of the customers present in another big table (with 100 M rows). I used a CROSS JOIN clause, which is resulting in a performance issue. Is there a better approach to implement this requirement which is will perform better ? Will a broadcast hint be suitable in this scenario ? What is the recommended approach to tackle such scenarios ? Reference table: (contains only 1 value) create temporary

All executors dead MinHash LSH PySpark approxSimilarityJoin self-join on EMR cluster

阅读更多关于 All executors dead MinHash LSH PySpark approxSimilarityJoin self-join on EMR cluster

问题 I run into problems when calling Spark's MinHashLSH's approxSimilarityJoin on a dataframe of (name_id, name) combinations. A summary of the problem I try to solve: I have a dataframe of around 30 million unique (name_id, name) combinations for company names. Some of those names refer to the same company, but are (i) either misspelled, and/or (ii) include additional names. Performing fuzzy string matching for every combination is not possible. To reduce the number of fuzzy string matching

Pyspark : how to code complicated dataframe calculation lead sum

阅读更多关于 Pyspark : how to code complicated dataframe calculation lead sum

问题 I have given dataframe that looks like this. THIS dataframe is sorted by date, and col1 is just some random value. TEST_schema = StructType([StructField("date", StringType(), True),\ StructField("col1", IntegerType(), True),\ ]) TEST_data = [('2020-08-01',3),('2020-08-02',1),('2020-08-03',-1),('2020-08-04',-1),('2020-08-05',3),\ ('2020-08-06',-1),('2020-08-07',6),('2020-08-08',4),('2020-08-09',5)] rdd3 = sc.parallelize(TEST_data) TEST_df = sqlContext.createDataFrame(TEST_data, TEST_schema)

Speed up InMemoryFileIndex for Spark SQL job with large number of input files

阅读更多关于 Speed up InMemoryFileIndex for Spark SQL job with large number of input files

问题 I have an apache spark sql job (using Datasets), coded in Java, that get's it's input from between 70,000 to 150,000 files. It appears to take anywhere from 45 minutes to 1.5 hours to build the InMemoryFileIndex. There are no logs, very low network usage, and almost no CPU usage during this time. Here's a sample of what I see in the std output: 24698 [main] INFO org.spark_project.jetty.server.handler.ContextHandler - Started o.s.j.s.ServletContextHandler@32ec9c90{/static/sql,null,AVAILABLE,