apache-spark-sql | 易学教程

How to compute the numerical difference between columns of different dataframes?

阅读更多关于 How to compute the numerical difference between columns of different dataframes?

问题 Given two spark dataframes A and B with the same number of columns and rows, I want to compute the numerical difference between the two dataframes and store it into another dataframe (or another data structure optionally). For instance let us have the following datasets DataFrame A: +----+---+ | A | B | +----+---+ | 1| 0| | 1| 0| +----+---+ DataFrame B: ----+---+ | A | B | +----+---+ | 1| 0 | | 0| 0 | +----+---+ How to obtain B-A, i.e +----+---+ | c1 | c2| +----+---+ | 0| 0 | | -1| 0 | +----+

How to compute the numerical difference between columns of different dataframes?

阅读更多关于 How to compute the numerical difference between columns of different dataframes?

How to get the size of a data frame before doing the broadcast join in pyspark

阅读更多关于 How to get the size of a data frame before doing the broadcast join in pyspark

问题 I am new to spark ,I want to do a broadcast join and before that i am trying to get the size of my data frame that i want to broadcast.. Is there anyway to find the size of a data frame . I am using Python as my programming language for spark Any help much appreciated 回答1: If you are looking for size in bytes as well as size in row count follow this- Alternative-1 // ### Alternative -1 /** * file content * spark-test-data.json * -------------------- * {"id":1,"name":"abc1"} * {"id":2,"name":

How to get the size of a data frame before doing the broadcast join in pyspark

阅读更多关于 How to get the size of a data frame before doing the broadcast join in pyspark

add columns in dataframes dynamically with column names as elements in List

阅读更多关于 add columns in dataframes dynamically with column names as elements in List

问题 I have List[N] like below val check = List ("a","b","c","d") where N can be any number of elements. I have a dataframe with only column called "value". Based on the contents of value i need to create N columns with column names as elements in the list and column contents as substring(x,y) I have tried all possible ways, like withColumn , selectExpr , nothing works. Please consider substring(X,Y) where X and Y as some numbers based on some metadata Below are my different codes which I tried,

add columns in dataframes dynamically with column names as elements in List

阅读更多关于 add columns in dataframes dynamically with column names as elements in List

How to read many tables from the same database and save them to their own CSV file?

阅读更多关于 How to read many tables from the same database and save them to their own CSV file?

问题 Below is a working code to connect to a SQL server,and save 1 table to a CSV format file. conf = new SparkConf().setAppName("test").setMaster("local").set("spark.driver.allowMultipleContexts", "true"); sc = new SparkContext(conf) sqlContext = new SQLContext(sc) df = sqlContext.read.format("jdbc").option("url","jdbc:sqlserver://DBServer:PORT").option("databaseName","xxx").option("driver","com.microsoft.sqlserver.jdbc.SQLServerDriver").option("dbtable","xxx").option("user","xxx").option(

generating join condition dynamically in spark/scala

阅读更多关于 generating join condition dynamically in spark/scala

问题 I want to be able to pass the join condition for two data frames as an input string. The idea is to make the join generic enough so that the user could pass on the condition they like. Here's how I am doing it right now. Although it works, I think its not clean. val testInput =Array("a=b", "c=d") val condition: Column = testInput.map(x => testMethod(x)).reduce((a,b) => a.and(b)) firstDataFrame.join(secondDataFrame, condition, "fullouter") Here's the testMethod def testMethod(inputString:

Change Decimal Precision of all 'Double type' Columns in a Spark Dataframe

阅读更多关于 Change Decimal Precision of all 'Double type' Columns in a Spark Dataframe

问题 I have a Spark DataFrame , let's say 'df'. I do the following simple aggregation on this DataFrame : df.groupBy().sum() Upon doing so, I get the following exception: java.lang.IllegalArgumentException: requirement failed: Decimal precision 39 exceeds max precision 38 Is there any way I can fix this? My guess is, if I can decrease the decimal precision of all the columns of double type in df, it would solve the problem. 来源： https://stackoverflow.com/questions/46462377/change-decimal-precision

Change Decimal Precision of all 'Double type' Columns in a Spark Dataframe

阅读更多关于 Change Decimal Precision of all 'Double type' Columns in a Spark Dataframe