apache-spark-sql | 易学教程

Scala/Spark - How to get first elements of all sub-arrays

阅读更多关于 Scala/Spark - How to get first elements of all sub-arrays

问题 I have the following DataFrame in a Spark (I'm using Scala): [[1003014, 0.95266926], [15, 0.9484202], [754, 0.94236785], [1029530, 0.880922], [3066, 0.7085166], [1066440, 0.69400793], [1045811, 0.663178], [1020059, 0.6274495], [1233982, 0.6112905], [1007801, 0.60937023], [1239278, 0.60044676], [1000088, 0.5789191], [1056268, 0.5747936], [1307569, 0.5676605], [10334513, 0.56592846], [930, 0.5446228], [1170206, 0.52525467], [300, 0.52473146], [2105178, 0.4972785], [1088572, 0.4815367]] I want

Scala/Spark - How to get first elements of all sub-arrays

阅读更多关于 Scala/Spark - How to get first elements of all sub-arrays

What tools to use to visualize logical and physical query plans?

阅读更多关于 What tools to use to visualize logical and physical query plans?

问题 I am familiar explain() (also WebUI). I was curious whether there are any tools that generate an image of the tree structure of the logical/physical plan before/after optimizations. That is the information returned by explain() as an image. 回答1: A picture like a PNG or JPG? Never heard of one myself, but you can see the physical plan using web UI (that you've already mentioned). The other phases of query execution are available using TreeNode methods which (among many methods that could help

What tools to use to visualize logical and physical query plans?

阅读更多关于 What tools to use to visualize logical and physical query plans?

How to concatenate multiple columns in PySpark with a separator?

阅读更多关于 How to concatenate multiple columns in PySpark with a separator?

问题 I have a pyspark Dataframe , I would like to join 3 columns. id | column_1 | column_2 | column_3 -------------------------------------------- 1 | 12 | 34 | 67 -------------------------------------------- 2 | 45 | 78 | 90 -------------------------------------------- 3 | 23 | 93 | 56 -------------------------------------------- I want to join the 3 columns : column_1, column_2, column_3 in only one adding between there value "-" Expect result: id | column_1 | column_2 | column_3 | column_join -

How to concatenate multiple columns in PySpark with a separator?

阅读更多关于 How to concatenate multiple columns in PySpark with a separator?

How to yield pandas dataframe rows to spark dataframe

阅读更多关于 How to yield pandas dataframe rows to spark dataframe

问题 Hi I'm making transformation, I have created some_function(iter) generator to yield Row(id=index, api=row['api'], A=row['A'], B=row['B'] to yield transformed rows from pandas dataframe to rdd and to spark dataframe. I'm getting errors. (I must use pandas to transform data as there is a large amount of legacy code) Input Spark DataFrame respond_sdf.show() +-------------------------------------------------------------------+ |content | +----------------------------------------------------------

How to calculate daily basis in pyspark dataframe (time series)

阅读更多关于 How to calculate daily basis in pyspark dataframe (time series)

问题 So I have a dataframe and I want to calculation some quantity let's say in daily basis..let's say we have 10 columns col1,col2,col3,col4... coln which each columns are dependent on value col1 , col2, col3 , col4.. and so on and the date resets based on the id .. +--------+----+---- +----+ date |col1|id |col2|. . |coln +--------+----+---- +----+ 2020-08-01| 0| M1 | . . . 3| 2020-08-02| 4| M1 | 10| 2020-08-03| 3| M1 | . . . 9 | 2020-08-04| 2| M1 | . . . 8 | 2020-08-05| 1| M1 | . . . 7 | 2020-08

Select spark dataframe column with special character in it using selectExpr

阅读更多关于 Select spark dataframe column with special character in it using selectExpr

问题 I am in a scenario where my columns name is Município with accent on the letter í . My selectExpr command is failing because of it. Is there a way to fix it? Basically I have something like the following expression: .selectExpr("...CAST (Município as string) as Município...") What I really want is to be able to leave the column with the same name that it came, so in the future, I won't have this kind of problem on different tables/files. How can I make spark dataframe accept accents or other

How to get Last 1 hour data, every 5 minutes, without grouping?

阅读更多关于 How to get Last 1 hour data, every 5 minutes, without grouping?

问题 How to trigger every 5 minutes and get data for the last 1 hour? I came up with this but it does not seem to give me all the rows in the last 1 hr. My reasoning is : Read the stream, filter data for last 1 hr based on timestamp column, and write/print using forEachbatch . And watermark it so that it does not hold on to all the past data. spark. readStream.format("delta").table("xxx") .withWatermark("ts", "60 minutes") .filter($"ts" > current_timestamp - expr("INTERVAL 60 minutes"))