apache-spark-sql

Load dataframe from pyspark

好久不见. 提交于 2020-12-15 05:23:06
问题 I am trying to connect to MS SQL DB from PySpark using spark.read.jdbc import os from pyspark.sql import * from pyspark.sql.functions import * from pyspark import SparkContext; from pyspark.sql.session import SparkSession sc = SparkContext.getOrCreate() spark = SparkSession(sc) df = spark.read \ .format('jdbc') \ .option('url', 'jdbc:sqlserver://local:1433') \ .option('user', 'sa') \ .option('password', '12345') \ .option('dbtable', '(select COL1, COL2 from tbl1 WHERE COL1 = 2)') then I do df

select best record possible

限于喜欢 提交于 2020-12-15 00:43:21
问题 Have different files in a directory as below f1.txt id FName Lname Adrress sex levelId t1 Girish Hm 10oak m 1111 t2 Kiran Kumar 5wren m 2222 t3 sara chauhan 15nvi f 6666 f2.txt t4 girish hm 11oak m 1111 t5 Kiran Kumar 5wren f 2222 t6 Prakash Jha 18nvi f 3333 f3.txt t7 Kiran Kumar 5wren f 2222 t8 Girish Hm 10oak m 1111 t9 Prakash Jha 18nvi m 3333 f4.txt t10 Kiran Kumar 5wren f 2222 t11 girish hm 10oak m 1111 t12 Prakash Jha 18nvi f 3333 only first name and last name constant here and case

Why spark (scala API) agg function takes expr and exprs arguments?

会有一股神秘感。 提交于 2020-12-13 03:39:23
问题 Spark API RelationalGroupedDataset has a function agg : @scala.annotation.varargs def agg(expr: Column, exprs: Column*): DataFrame = { toDF((expr +: exprs).map { case typed: TypedColumn[_, _] => typed.withInputType(df.exprEnc, df.logicalPlan.output).expr case c => c.expr }) } Why does it take two separate arguments? Why can't it take just exprs: Column* ? Has someone an implicit function that takes one argument? 回答1: This is to make sure that you specify at least one argument. Pure varargs

Why spark (scala API) agg function takes expr and exprs arguments?

孤者浪人 提交于 2020-12-13 03:37:49
问题 Spark API RelationalGroupedDataset has a function agg : @scala.annotation.varargs def agg(expr: Column, exprs: Column*): DataFrame = { toDF((expr +: exprs).map { case typed: TypedColumn[_, _] => typed.withInputType(df.exprEnc, df.logicalPlan.output).expr case c => c.expr }) } Why does it take two separate arguments? Why can't it take just exprs: Column* ? Has someone an implicit function that takes one argument? 回答1: This is to make sure that you specify at least one argument. Pure varargs

How to compare two StructType sharing same contents?

半城伤御伤魂 提交于 2020-12-13 03:31:25
问题 It seems like StructType preserves order, so two StructType containing same StructField s are not considered equivalent. For example: val st1 = StructType( StructField("ii",StringType,true) :: StructField("i",StringType,true) :: Nil) val st2 = StructType( StructField("i",StringType,true) :: StructField("ii",StringType,true) :: Nil) println(st1 == st2) returns false even though they both have StructField("i",StringType,true) and StructField("ii",StringType,true) , just in different order. I

How to compare two StructType sharing same contents?

北城以北 提交于 2020-12-13 03:30:24
问题 It seems like StructType preserves order, so two StructType containing same StructField s are not considered equivalent. For example: val st1 = StructType( StructField("ii",StringType,true) :: StructField("i",StringType,true) :: Nil) val st2 = StructType( StructField("i",StringType,true) :: StructField("ii",StringType,true) :: Nil) println(st1 == st2) returns false even though they both have StructField("i",StringType,true) and StructField("ii",StringType,true) , just in different order. I

PySpark get related records from its array object values

巧了我就是萌 提交于 2020-12-13 03:12:44
问题 I have a spark dataframe that has an ID column and along with other columns, it has an array column that contains the IDs of its related records, as its value. example dataframe will be of ID | NAME | RELATED_IDLIST -------------------------- 123 | mike | [345,456] 345 | alen | [789] 456 | sam | [789,999] 789 | marc | [111] 555 | dan | [333] From the above, I need to append all the related child Id's to the array column of the parent ID. The resultant DF should be like ID | NAME | RELATED

PySpark get related records from its array object values

℡╲_俬逩灬. 提交于 2020-12-13 03:12:37
问题 I have a spark dataframe that has an ID column and along with other columns, it has an array column that contains the IDs of its related records, as its value. example dataframe will be of ID | NAME | RELATED_IDLIST -------------------------- 123 | mike | [345,456] 345 | alen | [789] 456 | sam | [789,999] 789 | marc | [111] 555 | dan | [333] From the above, I need to append all the related child Id's to the array column of the parent ID. The resultant DF should be like ID | NAME | RELATED

How to use windowing functions efficiently to decide next N number of rows based on N number of previous values

爷,独闯天下 提交于 2020-12-09 06:14:17
问题 Hi i have the following data. +----------+----+-------+-----------------------+ | date|item|avg_val|conditions | +----------+----+-------+-----------------------+ |01-10-2020| x| 10| 0| |02-10-2020| x| 10| 0| |03-10-2020| x| 15| 1| |04-10-2020| x| 15| 1| |05-10-2020| x| 5| 0| |06-10-2020| x| 13| 1| |07-10-2020| x| 10| 1| |08-10-2020| x| 10| 0| |09-10-2020| x| 15| 1| |01-10-2020| y| 10| 0| |02-10-2020| y| 18| 0| |03-10-2020| y| 6| 1| |04-10-2020| y| 10| 0| |05-10-2020| y| 20| 0| +----------+--

Apply window function over multiple columns

|▌冷眼眸甩不掉的悲伤 提交于 2020-12-08 07:22:51
问题 I would like to perform window function (concretely moving average), but over all columns of a dataframe. I can do it this way from pyspark.sql import SparkSession, functions as func df = ... df.select([func.avg(df[col]).over(windowSpec).alias(col) for col in df.columns]) but I'm afraid this isn't very efficient. Is there a better way to do it? 回答1: An alternative which may be better is to create a new df where you Group By the columns in Window function and apply average on the remaining