apache-spark-sql

How to compute the numerical difference between columns of different dataframes?

你。 提交于 2021-02-08 09:16:09
问题 Given two spark dataframes A and B with the same number of columns and rows, I want to compute the numerical difference between the two dataframes and store it into another dataframe (or another data structure optionally). For instance let us have the following datasets DataFrame A: +----+---+ | A | B | +----+---+ | 1| 0| | 1| 0| +----+---+ DataFrame B: ----+---+ | A | B | +----+---+ | 1| 0 | | 0| 0 | +----+---+ How to obtain B-A, i.e +----+---+ | c1 | c2| +----+---+ | 0| 0 | | -1| 0 | +----+

How to compute the numerical difference between columns of different dataframes?

元气小坏坏 提交于 2021-02-08 09:14:29
问题 Given two spark dataframes A and B with the same number of columns and rows, I want to compute the numerical difference between the two dataframes and store it into another dataframe (or another data structure optionally). For instance let us have the following datasets DataFrame A: +----+---+ | A | B | +----+---+ | 1| 0| | 1| 0| +----+---+ DataFrame B: ----+---+ | A | B | +----+---+ | 1| 0 | | 0| 0 | +----+---+ How to obtain B-A, i.e +----+---+ | c1 | c2| +----+---+ | 0| 0 | | -1| 0 | +----+

How to get the size of a data frame before doing the broadcast join in pyspark

时光总嘲笑我的痴心妄想 提交于 2021-02-08 09:14:02
问题 I am new to spark ,I want to do a broadcast join and before that i am trying to get the size of my data frame that i want to broadcast.. Is there anyway to find the size of a data frame . I am using Python as my programming language for spark Any help much appreciated 回答1: If you are looking for size in bytes as well as size in row count follow this- Alternative-1 // ### Alternative -1 /** * file content * spark-test-data.json * -------------------- * {"id":1,"name":"abc1"} * {"id":2,"name":

How to get the size of a data frame before doing the broadcast join in pyspark

柔情痞子 提交于 2021-02-08 09:10:22
问题 I am new to spark ,I want to do a broadcast join and before that i am trying to get the size of my data frame that i want to broadcast.. Is there anyway to find the size of a data frame . I am using Python as my programming language for spark Any help much appreciated 回答1: If you are looking for size in bytes as well as size in row count follow this- Alternative-1 // ### Alternative -1 /** * file content * spark-test-data.json * -------------------- * {"id":1,"name":"abc1"} * {"id":2,"name":

add columns in dataframes dynamically with column names as elements in List

青春壹個敷衍的年華 提交于 2021-02-08 08:06:42
问题 I have List[N] like below val check = List ("a","b","c","d") where N can be any number of elements. I have a dataframe with only column called "value". Based on the contents of value i need to create N columns with column names as elements in the list and column contents as substring(x,y) I have tried all possible ways, like withColumn , selectExpr , nothing works. Please consider substring(X,Y) where X and Y as some numbers based on some metadata Below are my different codes which I tried,

add columns in dataframes dynamically with column names as elements in List

别等时光非礼了梦想. 提交于 2021-02-08 08:06:03
问题 I have List[N] like below val check = List ("a","b","c","d") where N can be any number of elements. I have a dataframe with only column called "value". Based on the contents of value i need to create N columns with column names as elements in the list and column contents as substring(x,y) I have tried all possible ways, like withColumn , selectExpr , nothing works. Please consider substring(X,Y) where X and Y as some numbers based on some metadata Below are my different codes which I tried,

How to read many tables from the same database and save them to their own CSV file?

人盡茶涼 提交于 2021-02-08 08:01:32
问题 Below is a working code to connect to a SQL server,and save 1 table to a CSV format file. conf = new SparkConf().setAppName("test").setMaster("local").set("spark.driver.allowMultipleContexts", "true"); sc = new SparkContext(conf) sqlContext = new SQLContext(sc) df = sqlContext.read.format("jdbc").option("url","jdbc:sqlserver://DBServer:PORT").option("databaseName","xxx").option("driver","com.microsoft.sqlserver.jdbc.SQLServerDriver").option("dbtable","xxx").option("user","xxx").option(

generating join condition dynamically in spark/scala

*爱你&永不变心* 提交于 2021-02-08 07:56:37
问题 I want to be able to pass the join condition for two data frames as an input string. The idea is to make the join generic enough so that the user could pass on the condition they like. Here's how I am doing it right now. Although it works, I think its not clean. val testInput =Array("a=b", "c=d") val condition: Column = testInput.map(x => testMethod(x)).reduce((a,b) => a.and(b)) firstDataFrame.join(secondDataFrame, condition, "fullouter") Here's the testMethod def testMethod(inputString:

Change Decimal Precision of all 'Double type' Columns in a Spark Dataframe

最后都变了- 提交于 2021-02-08 07:24:35
问题 I have a Spark DataFrame , let's say 'df'. I do the following simple aggregation on this DataFrame : df.groupBy().sum() Upon doing so, I get the following exception: java.lang.IllegalArgumentException: requirement failed: Decimal precision 39 exceeds max precision 38 Is there any way I can fix this? My guess is, if I can decrease the decimal precision of all the columns of double type in df, it would solve the problem. 来源: https://stackoverflow.com/questions/46462377/change-decimal-precision

Change Decimal Precision of all 'Double type' Columns in a Spark Dataframe

筅森魡賤 提交于 2021-02-08 07:22:27
问题 I have a Spark DataFrame , let's say 'df'. I do the following simple aggregation on this DataFrame : df.groupBy().sum() Upon doing so, I get the following exception: java.lang.IllegalArgumentException: requirement failed: Decimal precision 39 exceeds max precision 38 Is there any way I can fix this? My guess is, if I can decrease the decimal precision of all the columns of double type in df, it would solve the problem. 来源: https://stackoverflow.com/questions/46462377/change-decimal-precision