apache-spark-sql | 易学教程

Load dataframe from pyspark

阅读更多关于 Load dataframe from pyspark

问题 I am trying to connect to MS SQL DB from PySpark using spark.read.jdbc import os from pyspark.sql import * from pyspark.sql.functions import * from pyspark import SparkContext; from pyspark.sql.session import SparkSession sc = SparkContext.getOrCreate() spark = SparkSession(sc) df = spark.read \ .format('jdbc') \ .option('url', 'jdbc:sqlserver://local:1433') \ .option('user', 'sa') \ .option('password', '12345') \ .option('dbtable', '(select COL1, COL2 from tbl1 WHERE COL1 = 2)') then I do df

select best record possible

阅读更多关于 select best record possible

问题 Have different files in a directory as below f1.txt id FName Lname Adrress sex levelId t1 Girish Hm 10oak m 1111 t2 Kiran Kumar 5wren m 2222 t3 sara chauhan 15nvi f 6666 f2.txt t4 girish hm 11oak m 1111 t5 Kiran Kumar 5wren f 2222 t6 Prakash Jha 18nvi f 3333 f3.txt t7 Kiran Kumar 5wren f 2222 t8 Girish Hm 10oak m 1111 t9 Prakash Jha 18nvi m 3333 f4.txt t10 Kiran Kumar 5wren f 2222 t11 girish hm 10oak m 1111 t12 Prakash Jha 18nvi f 3333 only first name and last name constant here and case

Why spark (scala API) agg function takes expr and exprs arguments?

阅读更多关于 Why spark (scala API) agg function takes expr and exprs arguments?

问题 Spark API RelationalGroupedDataset has a function agg : @scala.annotation.varargs def agg(expr: Column, exprs: Column*): DataFrame = { toDF((expr +: exprs).map { case typed: TypedColumn[_, _] => typed.withInputType(df.exprEnc, df.logicalPlan.output).expr case c => c.expr }) } Why does it take two separate arguments? Why can't it take just exprs: Column* ? Has someone an implicit function that takes one argument? 回答1: This is to make sure that you specify at least one argument. Pure varargs

Why spark (scala API) agg function takes expr and exprs arguments?

阅读更多关于 Why spark (scala API) agg function takes expr and exprs arguments?

How to compare two StructType sharing same contents?

阅读更多关于 How to compare two StructType sharing same contents?

问题 It seems like StructType preserves order, so two StructType containing same StructField s are not considered equivalent. For example: val st1 = StructType( StructField("ii",StringType,true) :: StructField("i",StringType,true) :: Nil) val st2 = StructType( StructField("i",StringType,true) :: StructField("ii",StringType,true) :: Nil) println(st1 == st2) returns false even though they both have StructField("i",StringType,true) and StructField("ii",StringType,true) , just in different order. I

How to compare two StructType sharing same contents?

阅读更多关于 How to compare two StructType sharing same contents?

PySpark get related records from its array object values

阅读更多关于 PySpark get related records from its array object values

PySpark get related records from its array object values

阅读更多关于 PySpark get related records from its array object values

How to use windowing functions efficiently to decide next N number of rows based on N number of previous values

阅读更多关于 How to use windowing functions efficiently to decide next N number of rows based on N number of previous values

问题 Hi i have the following data. +----------+----+-------+-----------------------+ | date|item|avg_val|conditions | +----------+----+-------+-----------------------+ |01-10-2020| x| 10| 0| |02-10-2020| x| 10| 0| |03-10-2020| x| 15| 1| |04-10-2020| x| 15| 1| |05-10-2020| x| 5| 0| |06-10-2020| x| 13| 1| |07-10-2020| x| 10| 1| |08-10-2020| x| 10| 0| |09-10-2020| x| 15| 1| |01-10-2020| y| 10| 0| |02-10-2020| y| 18| 0| |03-10-2020| y| 6| 1| |04-10-2020| y| 10| 0| |05-10-2020| y| 20| 0| +----------+--

Apply window function over multiple columns

阅读更多关于 Apply window function over multiple columns

问题 I would like to perform window function (concretely moving average), but over all columns of a dataframe. I can do it this way from pyspark.sql import SparkSession, functions as func df = ... df.select([func.avg(df[col]).over(windowSpec).alias(col) for col in df.columns]) but I'm afraid this isn't very efficient. Is there a better way to do it? 回答1: An alternative which may be better is to create a new df where you Group By the columns in Window function and apply average on the remaining