spark-dataframe

Spark treating null values in csv column as null datatype

点点圈 提交于 2021-02-04 18:07:22
问题 My spark application reads a csv file, transforms it to a different format with sql and writes the result dataframe to a different csv file. For example, I have input csv as follows: Id|FirstName|LastName|LocationId 1|John|Doe|123 2|Alex|Doe|234 My transformation is: Select Id, FirstName, LastName, LocationId as PrimaryLocationId, null as SecondaryLocationId from Input (I can't answer why the null is being used as SecondaryLocationId, it is business use case) Now spark can't figure out the

Flatten Nested Struct in PySpark Array

跟風遠走 提交于 2021-02-04 16:37:26
问题 Given a schema like: root |-- first_name: string |-- last_name: string |-- degrees: array | |-- element: struct | | |-- school: string | | |-- advisors: struct | | | |-- advisor1: string | | | |-- advisor2: string How can I get a schema like: root |-- first_name: string |-- last_name: string |-- degrees: array | |-- element: struct | | |-- school: string | | |-- advisor1: string | | |-- advisor2: string Currently, I explode the array, flatten the structure by selecting advisor.* and then

Is there any size limit for Spark-Dataframe to process/hold columns at a time?

♀尐吖头ヾ 提交于 2021-01-29 03:30:53
问题 I would like to know, whether Spark-Dataframe has a limitation on Column Size ? Like max columns that can be processed/hold by a Dataframe at a time is less than 500. I am asking because, while parsing a xml with less than 500 tags, I can process and generate a corresponding parquet file successfully, but if it is more than 500, then the parquet getting generated is empty. Any idea on this ? 来源: https://stackoverflow.com/questions/38696047/is-there-any-size-limit-for-spark-dataframe-to

Checking whether a column has proper decimal number

♀尐吖头ヾ 提交于 2021-01-28 08:55:14
问题 I have a dataframe ( input_dataframe ), which looks like as below: id test_column 1 0.25 2 1.1 3 12 4 test 5 1.3334 6 .11 I want to add a column result , which put values 1 if test_column has a decimal value and 0 if test_column has any other value. data type of test_column is string. Below is the expected output: id test_column result 1 0.25 1 2 1.1 1 3 12 0 4 test 0 5 1.3334 1 6 .11 1 Can we achieve it using pySpark code? 回答1: You can parse decimal token with decimal.Decimal() Here we are

How to make VectorAssembler do not compress data?

僤鯓⒐⒋嵵緔 提交于 2021-01-28 05:32:51
问题 I want to transform multiple columns to one column using VectorAssembler ,but the data is compressed by default without other options. val arr2= Array((1,2,0,0,0),(1,2,3,0,0),(1,2,4,5,0),(1,2,2,5,6)) val df=sc.parallelize(arr2).toDF("a","b","c","e","f") val colNames=Array("a","b","c","e","f") val assembler = new VectorAssembler() .setInputCols(colNames) .setOutputCol("newCol") val transDF= assembler.transform(df).select(col("newCol")) transDF.show(false) The input is: +---+---+---+---+---+ |

How to Split rows to different columns in Spark DataFrame/DataSet?

老子叫甜甜 提交于 2021-01-28 02:18:31
问题 Suppose I have data set like : Name | Subject | Y1 | Y2 A | math | 1998| 2000 B | | 1996| 1999 | science | 2004| 2005 I want to split rows of this data set such that Y2 column will be eliminated like : Name | Subject | Y1 A | math | 1998 A | math | 1999 A | math | 2000 B | | 1996 B | | 1997 B | | 1998 B | | 1999 | science | 2004 | science | 2005 Can someone suggest something here ? I hope I had made my query clear. Thanks in advance. 回答1: I think you only need to create an udf to create the

How do I pass parameters to selectExpr? SparkSQL-Scala

和自甴很熟 提交于 2021-01-27 22:23:32
问题 :) When you have a data frame, you can add columns and fill their rows with the method selectExprt Something like this: scala> table.show +------+--------+---------+--------+--------+ |idempr|tipperrd| codperrd|tipperrt|codperrt| +------+--------+---------+--------+--------+ | OlcM| h|999999999| J| 0| | zOcQ| r|777777777| J| 1| | kyGp| t|333333333| J| 2| | BEuX| A|999999999| F| 3| scala> var table2 = table.selectExpr("idempr", "tipperrd", "codperrd", "tipperrt", "codperrt", "'hola' as Saludo"

How do I pass parameters to selectExpr? SparkSQL-Scala

懵懂的女人 提交于 2021-01-27 20:31:03
问题 :) When you have a data frame, you can add columns and fill their rows with the method selectExprt Something like this: scala> table.show +------+--------+---------+--------+--------+ |idempr|tipperrd| codperrd|tipperrt|codperrt| +------+--------+---------+--------+--------+ | OlcM| h|999999999| J| 0| | zOcQ| r|777777777| J| 1| | kyGp| t|333333333| J| 2| | BEuX| A|999999999| F| 3| scala> var table2 = table.selectExpr("idempr", "tipperrd", "codperrd", "tipperrt", "codperrt", "'hola' as Saludo"

Spark 2.0.0: SparkR CSV Import

依然范特西╮ 提交于 2021-01-27 06:48:37
问题 I am trying to read a csv file into SparkR (running Spark 2.0.0) - & trying to experiment with the newly added features. Using RStudio here. I am getting an error while "reading" the source file. My code: Sys.setenv(SPARK_HOME = "C:/spark-2.0.0-bin-hadoop2.6") library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"))) sparkR.session(master = "local[*]", appName = "SparkR") df <- loadDF("F:/file.csv", "csv", header = "true") I get an error at at the loadDF function. The

Spark 2.0.0: SparkR CSV Import

我是研究僧i 提交于 2021-01-27 06:46:43
问题 I am trying to read a csv file into SparkR (running Spark 2.0.0) - & trying to experiment with the newly added features. Using RStudio here. I am getting an error while "reading" the source file. My code: Sys.setenv(SPARK_HOME = "C:/spark-2.0.0-bin-hadoop2.6") library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"))) sparkR.session(master = "local[*]", appName = "SparkR") df <- loadDF("F:/file.csv", "csv", header = "true") I get an error at at the loadDF function. The