join

Joining Spark DataFrames on a nearest key condition

安稳与你 提交于 2020-05-24 20:39:02
问题 What’s a performant way to do fuzzy joins in PySpark? I am looking for the community's views on a scalable approach to joining large Spark DataFrames on a nearest key condition. Allow me to illustrate this problem by means of a representative example. Suppose we have the following Spark DataFrame containing events occurring at some point in time: ddf_event = spark.createDataFrame( data=[ [1, 'A'], [5, 'A'], [10, 'B'], [15, 'A'], [20, 'B'], [25, 'B'], [30, 'A'] ], schema=['ts_event', 'event']

Joining Spark DataFrames on a nearest key condition

我与影子孤独终老i 提交于 2020-05-24 20:38:29
问题 What’s a performant way to do fuzzy joins in PySpark? I am looking for the community's views on a scalable approach to joining large Spark DataFrames on a nearest key condition. Allow me to illustrate this problem by means of a representative example. Suppose we have the following Spark DataFrame containing events occurring at some point in time: ddf_event = spark.createDataFrame( data=[ [1, 'A'], [5, 'A'], [10, 'B'], [15, 'A'], [20, 'B'], [25, 'B'], [30, 'A'] ], schema=['ts_event', 'event']

Calculate average using Spark Scala

喜你入骨 提交于 2020-05-24 04:54:04
问题 How do I calculate the Average salary per location in Spark Scala with below two data sets ? File1.csv(Column 4 is salary) Ram, 30, Engineer, 40000 Bala, 27, Doctor, 30000 Hari, 33, Engineer, 50000 Siva, 35, Doctor, 60000 File2.csv(Column 2 is location) Hari, Bangalore Ram, Chennai Bala, Bangalore Siva, Chennai The above files are not sorted. Need to join these 2 files and find average salary per location. I tried with below code but unable to make it. val salary = sc.textFile("File1.csv")

How to do join on multiple criteria, returning all combinations of both criteria

谁说胖子不能爱 提交于 2020-05-22 12:07:05
问题 I am willing to bet that this is a really simple answer as I am a noob to SQL. table 1 has column 1 (criteria 1) column 2 (criteria 2) column 3 (metric 1) table 2 has column 1 (criteria 1) column 2 (criteria 2) column 3 (metric 2 specific to table2.criteria2) There can be anywhere from 1 - 5 values of criteria 2 for each criteria 1 on the table. when I use the join statement here (assuming I identify table 1 as One prior to this): Select WeddingTable, TableSeat, TableSeatID, Name, Two.Meal

How to do join on multiple criteria, returning all combinations of both criteria

纵饮孤独 提交于 2020-05-22 12:06:22
问题 I am willing to bet that this is a really simple answer as I am a noob to SQL. table 1 has column 1 (criteria 1) column 2 (criteria 2) column 3 (metric 1) table 2 has column 1 (criteria 1) column 2 (criteria 2) column 3 (metric 2 specific to table2.criteria2) There can be anywhere from 1 - 5 values of criteria 2 for each criteria 1 on the table. when I use the join statement here (assuming I identify table 1 as One prior to this): Select WeddingTable, TableSeat, TableSeatID, Name, Two.Meal

SQL query to get a joinned table

北城余情 提交于 2020-05-17 08:19:46
问题 I have two tables that I need to join and need to get the data that I can use to plot. Sample data for two tables are: **table1** mon_pjt month planned_hours pjt1 01-10-2019 24 pjt2 01-01-2020 67 pjt3 01-02-2019 12 **table2** date project hrs_consumed 07-12-2019 pjt1 7 09-09-2019 pjt2 3 12-10-2019 pjt1 4 01-02-2019 pjt3 5 11-10-2019 pjt1 4 Sample Output, where the actual hours are summation of column hrs_consumed in table2. Following is the sample output: project label planned_hours actual

how to return alternative columns on join

不问归期 提交于 2020-05-17 06:58:25
问题 I have a table with a list of functionalities for my site. Say it has three columns: id_usr - url - landing_page 1 a.php a.html 2 b.php b.html 3 c.php c.html 4 d.php d.html Then I have a table where for each user i have those functionalities he can display: id_usr - func 1 1 1 3 My query selects those functionalities that the user is allowed to see and returns their url. So with the sample data it returns a.php, c.php And this is correct. The query is: SELECT titolo, descr1,descr2, url, url

how to return alternative columns on join

三世轮回 提交于 2020-05-17 06:58:05
问题 I have a table with a list of functionalities for my site. Say it has three columns: id_usr - url - landing_page 1 a.php a.html 2 b.php b.html 3 c.php c.html 4 d.php d.html Then I have a table where for each user i have those functionalities he can display: id_usr - func 1 1 1 3 My query selects those functionalities that the user is allowed to see and returns their url. So with the sample data it returns a.php, c.php And this is correct. The query is: SELECT titolo, descr1,descr2, url, url

case statement options splitted on two output columns

狂风中的少年 提交于 2020-05-17 06:14:05
问题 I have a table with a list of functionalities for my site. Say it has three columns: id_usr - url - landing_page 1 a.php a.html 2 b.php b.html 3 c.php c.html 4 d.php d.html Then I have a table where for each user i have those functionalities he can display: id_usr - func 1 1 1 3 This query (from this question of mine) SELECT f.id, CASE WHEN id_user IS NOT NULL THEN url ELSE landing_page END FROM funzioni f LEFT JOIN funz_abilitate fa ON fa.id_funzione = f.id AND fa.id_user = $id is returning

join columns separated by delimiter in same table

僤鯓⒐⒋嵵緔 提交于 2020-05-17 05:55:06
问题 I have the following data set color_code fav_color_code color_code_name fav_color_name 1|2 5 blue|white black 3|4 7|9 green|red pink|yellow I need to join first value of color_code to first value of color_code_name and second value of color_code to second value of color_code_name etc.. code color 1 blue 2 white 5 black 3 green 4 red 7 pink 9 yellow I am using the below code but it is doing cross join since I dont have id to join upon. This code work if I am mapping 2 columns but not multiple