apache-spark-sql

Scala/Spark - How to get first elements of all sub-arrays

旧街凉风 提交于 2021-01-05 09:10:04
问题 I have the following DataFrame in a Spark (I'm using Scala): [[1003014, 0.95266926], [15, 0.9484202], [754, 0.94236785], [1029530, 0.880922], [3066, 0.7085166], [1066440, 0.69400793], [1045811, 0.663178], [1020059, 0.6274495], [1233982, 0.6112905], [1007801, 0.60937023], [1239278, 0.60044676], [1000088, 0.5789191], [1056268, 0.5747936], [1307569, 0.5676605], [10334513, 0.56592846], [930, 0.5446228], [1170206, 0.52525467], [300, 0.52473146], [2105178, 0.4972785], [1088572, 0.4815367]] I want

Scala/Spark - How to get first elements of all sub-arrays

一曲冷凌霜 提交于 2021-01-05 09:08:17
问题 I have the following DataFrame in a Spark (I'm using Scala): [[1003014, 0.95266926], [15, 0.9484202], [754, 0.94236785], [1029530, 0.880922], [3066, 0.7085166], [1066440, 0.69400793], [1045811, 0.663178], [1020059, 0.6274495], [1233982, 0.6112905], [1007801, 0.60937023], [1239278, 0.60044676], [1000088, 0.5789191], [1056268, 0.5747936], [1307569, 0.5676605], [10334513, 0.56592846], [930, 0.5446228], [1170206, 0.52525467], [300, 0.52473146], [2105178, 0.4972785], [1088572, 0.4815367]] I want

What tools to use to visualize logical and physical query plans?

≯℡__Kan透↙ 提交于 2021-01-04 05:38:20
问题 I am familiar explain() (also WebUI). I was curious whether there are any tools that generate an image of the tree structure of the logical/physical plan before/after optimizations. That is the information returned by explain() as an image. 回答1: A picture like a PNG or JPG? Never heard of one myself, but you can see the physical plan using web UI (that you've already mentioned). The other phases of query execution are available using TreeNode methods which (among many methods that could help

What tools to use to visualize logical and physical query plans?

孤人 提交于 2021-01-04 05:35:18
问题 I am familiar explain() (also WebUI). I was curious whether there are any tools that generate an image of the tree structure of the logical/physical plan before/after optimizations. That is the information returned by explain() as an image. 回答1: A picture like a PNG or JPG? Never heard of one myself, but you can see the physical plan using web UI (that you've already mentioned). The other phases of query execution are available using TreeNode methods which (among many methods that could help

How to concatenate multiple columns in PySpark with a separator?

。_饼干妹妹 提交于 2021-01-04 05:32:46
问题 I have a pyspark Dataframe , I would like to join 3 columns. id | column_1 | column_2 | column_3 -------------------------------------------- 1 | 12 | 34 | 67 -------------------------------------------- 2 | 45 | 78 | 90 -------------------------------------------- 3 | 23 | 93 | 56 -------------------------------------------- I want to join the 3 columns : column_1, column_2, column_3 in only one adding between there value "-" Expect result: id | column_1 | column_2 | column_3 | column_join -

How to concatenate multiple columns in PySpark with a separator?

…衆ロ難τιáo~ 提交于 2021-01-04 05:32:06
问题 I have a pyspark Dataframe , I would like to join 3 columns. id | column_1 | column_2 | column_3 -------------------------------------------- 1 | 12 | 34 | 67 -------------------------------------------- 2 | 45 | 78 | 90 -------------------------------------------- 3 | 23 | 93 | 56 -------------------------------------------- I want to join the 3 columns : column_1, column_2, column_3 in only one adding between there value "-" Expect result: id | column_1 | column_2 | column_3 | column_join -

How to yield pandas dataframe rows to spark dataframe

泄露秘密 提交于 2021-01-01 08:10:36
问题 Hi I'm making transformation, I have created some_function(iter) generator to yield Row(id=index, api=row['api'], A=row['A'], B=row['B'] to yield transformed rows from pandas dataframe to rdd and to spark dataframe. I'm getting errors. (I must use pandas to transform data as there is a large amount of legacy code) Input Spark DataFrame respond_sdf.show() +-------------------------------------------------------------------+ |content | +----------------------------------------------------------

How to calculate daily basis in pyspark dataframe (time series)

核能气质少年 提交于 2021-01-01 06:27:25
问题 So I have a dataframe and I want to calculation some quantity let's say in daily basis..let's say we have 10 columns col1,col2,col3,col4... coln which each columns are dependent on value col1 , col2, col3 , col4.. and so on and the date resets based on the id .. +--------+----+---- +----+ date |col1|id |col2|. . |coln +--------+----+---- +----+ 2020-08-01| 0| M1 | . . . 3| 2020-08-02| 4| M1 | 10| 2020-08-03| 3| M1 | . . . 9 | 2020-08-04| 2| M1 | . . . 8 | 2020-08-05| 1| M1 | . . . 7 | 2020-08

Select spark dataframe column with special character in it using selectExpr

夙愿已清 提交于 2021-01-01 04:29:11
问题 I am in a scenario where my columns name is Município with accent on the letter í . My selectExpr command is failing because of it. Is there a way to fix it? Basically I have something like the following expression: .selectExpr("...CAST (Município as string) as Município...") What I really want is to be able to leave the column with the same name that it came, so in the future, I won't have this kind of problem on different tables/files. How can I make spark dataframe accept accents or other

How to get Last 1 hour data, every 5 minutes, without grouping?

僤鯓⒐⒋嵵緔 提交于 2020-12-30 03:13:27
问题 How to trigger every 5 minutes and get data for the last 1 hour? I came up with this but it does not seem to give me all the rows in the last 1 hr. My reasoning is : Read the stream, filter data for last 1 hr based on timestamp column, and write/print using forEachbatch . And watermark it so that it does not hold on to all the past data. spark. readStream.format("delta").table("xxx") .withWatermark("ts", "60 minutes") .filter($"ts" > current_timestamp - expr("INTERVAL 60 minutes"))