spark-dataframe

How to change case of whole pyspark dataframe to lower or upper

和自甴很熟 提交于 2019-12-13 14:18:21
问题 I am trying to apply pyspark sql functions hash algorithm for every row in two dataframes to identify the differences. Hash algorithm is case sensitive .i.e. if column contains 'APPLE' and 'Apple' are considered as two different values, so I want to change the case for both dataframes to either upper or lower. I am able to achieve only for dataframe headers but not for dataframe values.Please help #Code for Dataframe column headers self.df_db1 =self.df_db1.toDF(*[c.lower() for c in self.df

How to generate a DataFrame with random content and N rows?

吃可爱长大的小学妹 提交于 2019-12-13 12:32:17
问题 How can I create a Spark DataFrame in Scala with 100 rows and 3 columns that have random integer values in range (1, 100)? I know how to create a DataFrame manually, but I cannot automate it: val df = sc.parallelize(Seq((1,20, 40), (60, 10, 80), (30, 15, 30))).toDF("col1", "col2", "col3") 回答1: Here you go, Seq.fill is your friend: def randomInt1to100 = scala.util.Random.nextInt(100)+1 val df = sc.parallelize( Seq.fill(100){(randomInt1to100,randomInt1to100,randomInt1to100)} ).toDF("col1",

Change column value in a dataframe spark scala

元气小坏坏 提交于 2019-12-13 10:33:51
问题 This is how my dataframe looks like at the moment +------------+ | DATE | +------------+ | 19931001| | 19930404| | 19930603| | 19930805| +------------+ I am trying to reformat this string value to yyyy-mm-dd hh:mm:ss.fff and keep it as a string not a date type or time stamp. How would I do that using the withColumn method ? 回答1: Here is the solution using UDF and withcolumn , I have assumed that you have a string date field in Dataframe //Create dfList dataframe val dfList = spark

Apache Spark : how to insert data in a column with empty values in dataFrame using Java

£可爱£侵袭症+ 提交于 2019-12-13 09:06:31
问题 I have to insert values available in DataFrame1 into one of the column with empty values with DataFrame2 . Basically updating column in DataFrame2 . Both DataFrames have 2 common columns. Is there a way to do same using Java? Or there can be different approach? Sample Input : 1) File1.csv BILL_ID,BILL_NBR_TYPE_CD,BILL_NBR,VERSION,PRIM_SW 0501841898,BIN ,404154,1000,Y 0681220958,BIN ,735332,1000,Y 5992410180,BIN ,454680,1000,Y 6995270884,SREBIN ,1000252750295575,1000,Y Here BILL_ID is system

How call method based on Json Object scala spark?

倖福魔咒の 提交于 2019-12-13 09:06:19
问题 I Have two functions like below def method1(ip:String,r:Double,op:String)={ val data = spark.read.option("header", true).csv(ip).toDF() val r3= data.select("c", "S").dropDuplicates("C", "S").withColumn("R", lit(r)) r3.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save(op) } def method2(ip:String,op:String)={ val data = spark.read.option("header", true).csv(ip).toDF() val r3= data.select("c", "S").dropDuplicates("C", "StockCode") r3.coalesce(1).write.format("com

Spark: Process multiline input blob

徘徊边缘 提交于 2019-12-13 08:10:31
问题 I'm new to Hadoop/Spark and trying to process a multiple line input blob into a csv or tab delimited format for further processing. Example Input ------------------------------------------------------------------------ AAA=someValueAAA1 BBB=someValueBBB1 CCC=someValueCCC1 DDD=someValueDDD1 EEE=someValueEEE1 FFF=someValueFFF1 ENDOFRECORD ------------------------------------------------------------------------ AAA=someValueAAA2 BBB=someValueBBB2 CCC=someValueCCC2 DDD=someValueDDD2 EEE

how to output multiple (key,value) in spark map function

佐手、 提交于 2019-12-13 08:05:02
问题 The format of input data likes below: +--------------------+-------------+--------------------+ | StudentID| Right | Wrong | +--------------------+-------------+--------------------+ | studentNo01 | a,b,c | x,y,z | +--------------------+-------------+--------------------+ | studentNo02 | c,d | v,w | +--------------------+-------------+--------------------+ And the format of output likes below(): +--------------------+---------+ | key | value| +--------------------+---------+ | studentNo01,a |

Feasibility of Hive to Netezza data export using spark

北城以北 提交于 2019-12-13 07:08:20
问题 This mail is to discuss on a use case, on which my team is working. It's to export metadata and data from a HIVE server to RDBMS. On doing that, export to MySQL and ORACLE is working good, but export to Netezza is failing with error message: 17/02/09 16:03:07 INFO DAGScheduler: Job 1 finished: json at RdbmsSandboxExecution.java:80, took 0.433405 s 17/02/09 16:03:07 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 3) in 143 ms on localhost (1/1) 17/02/09 16:03:07 INFO TaskSchedulerImpl

ExceptionInInitializer Error while Reading Data from teradata table using Spark

守給你的承諾、 提交于 2019-12-13 03:45:47
问题 I am using the below code to read data from teradata but getting error val jdbcDF = spark.read .format("jdbc") .option("url",s"jdbc:teradata://${TeradataDBHost}/database=${TeradataDBDatabase}") .option("dbtable", TeradataDBDatabase+"."+TeradataDBTable) .option("driver","com.teradata.jdbc.TeraDriver") .option("user", TeradataDBUsername) .option("password", TeradataDBPassword) .load() Error Stack Trace Exception in thread "main" java.lang.ExceptionInInitializerError at com.teradata.jdbc.jdbc

Spark 2 iterating over a partition to create a new partition

家住魔仙堡 提交于 2019-12-13 03:42:40
问题 I have been scratching my head trying to come up with a way to reduce a dataframe in spark to a frame which records gaps in the dataframe, preferably without completely killing parallelism. Here is a much-simplified example (It's a bit lengthy because I wanted it to be able to run): import org.apache.spark.sql.SparkSession case class Record(typ: String, start: Int, end: Int); object Sample { def main(argv: Array[String]): Unit = { val sparkSession = SparkSession.builder() .master("local")