spark-dataframe | 易学教程

How to change case of whole pyspark dataframe to lower or upper

阅读更多关于 How to change case of whole pyspark dataframe to lower or upper

问题 I am trying to apply pyspark sql functions hash algorithm for every row in two dataframes to identify the differences. Hash algorithm is case sensitive .i.e. if column contains 'APPLE' and 'Apple' are considered as two different values, so I want to change the case for both dataframes to either upper or lower. I am able to achieve only for dataframe headers but not for dataframe values.Please help #Code for Dataframe column headers self.df_db1 =self.df_db1.toDF(*[c.lower() for c in self.df

How to generate a DataFrame with random content and N rows?

阅读更多关于 How to generate a DataFrame with random content and N rows?

问题 How can I create a Spark DataFrame in Scala with 100 rows and 3 columns that have random integer values in range (1, 100)? I know how to create a DataFrame manually, but I cannot automate it: val df = sc.parallelize(Seq((1,20, 40), (60, 10, 80), (30, 15, 30))).toDF("col1", "col2", "col3") 回答1: Here you go, Seq.fill is your friend: def randomInt1to100 = scala.util.Random.nextInt(100)+1 val df = sc.parallelize( Seq.fill(100){(randomInt1to100,randomInt1to100,randomInt1to100)} ).toDF("col1",

Change column value in a dataframe spark scala

阅读更多关于 Change column value in a dataframe spark scala

问题 This is how my dataframe looks like at the moment +------------+ | DATE | +------------+ | 19931001| | 19930404| | 19930603| | 19930805| +------------+ I am trying to reformat this string value to yyyy-mm-dd hh:mm:ss.fff and keep it as a string not a date type or time stamp. How would I do that using the withColumn method ? 回答1: Here is the solution using UDF and withcolumn , I have assumed that you have a string date field in Dataframe //Create dfList dataframe val dfList = spark

Apache Spark : how to insert data in a column with empty values in dataFrame using Java

阅读更多关于 Apache Spark : how to insert data in a column with empty values in dataFrame using Java

问题 I have to insert values available in DataFrame1 into one of the column with empty values with DataFrame2 . Basically updating column in DataFrame2 . Both DataFrames have 2 common columns. Is there a way to do same using Java? Or there can be different approach? Sample Input : 1) File1.csv BILL_ID,BILL_NBR_TYPE_CD,BILL_NBR,VERSION,PRIM_SW 0501841898,BIN ,404154,1000,Y 0681220958,BIN ,735332,1000,Y 5992410180,BIN ,454680,1000,Y 6995270884,SREBIN ,1000252750295575,1000,Y Here BILL_ID is system

How call method based on Json Object scala spark?

阅读更多关于 How call method based on Json Object scala spark?

问题 I Have two functions like below def method1(ip:String,r:Double,op:String)={ val data = spark.read.option("header", true).csv(ip).toDF() val r3= data.select("c", "S").dropDuplicates("C", "S").withColumn("R", lit(r)) r3.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").save(op) } def method2(ip:String,op:String)={ val data = spark.read.option("header", true).csv(ip).toDF() val r3= data.select("c", "S").dropDuplicates("C", "StockCode") r3.coalesce(1).write.format("com

Spark: Process multiline input blob

阅读更多关于 Spark: Process multiline input blob

问题 I'm new to Hadoop/Spark and trying to process a multiple line input blob into a csv or tab delimited format for further processing. Example Input ------------------------------------------------------------------------ AAA=someValueAAA1 BBB=someValueBBB1 CCC=someValueCCC1 DDD=someValueDDD1 EEE=someValueEEE1 FFF=someValueFFF1 ENDOFRECORD ------------------------------------------------------------------------ AAA=someValueAAA2 BBB=someValueBBB2 CCC=someValueCCC2 DDD=someValueDDD2 EEE

how to output multiple (key,value) in spark map function

阅读更多关于 how to output multiple (key,value) in spark map function

问题 The format of input data likes below: +--------------------+-------------+--------------------+ | StudentID| Right | Wrong | +--------------------+-------------+--------------------+ | studentNo01 | a,b,c | x,y,z | +--------------------+-------------+--------------------+ | studentNo02 | c,d | v,w | +--------------------+-------------+--------------------+ And the format of output likes below(): +--------------------+---------+ | key | value| +--------------------+---------+ | studentNo01,a |

Feasibility of Hive to Netezza data export using spark

阅读更多关于 Feasibility of Hive to Netezza data export using spark

问题 This mail is to discuss on a use case, on which my team is working. It's to export metadata and data from a HIVE server to RDBMS. On doing that, export to MySQL and ORACLE is working good, but export to Netezza is failing with error message: 17/02/09 16:03:07 INFO DAGScheduler: Job 1 finished: json at RdbmsSandboxExecution.java:80, took 0.433405 s 17/02/09 16:03:07 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 3) in 143 ms on localhost (1/1) 17/02/09 16:03:07 INFO TaskSchedulerImpl

ExceptionInInitializer Error while Reading Data from teradata table using Spark

阅读更多关于 ExceptionInInitializer Error while Reading Data from teradata table using Spark

问题 I am using the below code to read data from teradata but getting error val jdbcDF = spark.read .format("jdbc") .option("url",s"jdbc:teradata://${TeradataDBHost}/database=${TeradataDBDatabase}") .option("dbtable", TeradataDBDatabase+"."+TeradataDBTable) .option("driver","com.teradata.jdbc.TeraDriver") .option("user", TeradataDBUsername) .option("password", TeradataDBPassword) .load() Error Stack Trace Exception in thread "main" java.lang.ExceptionInInitializerError at com.teradata.jdbc.jdbc

Spark 2 iterating over a partition to create a new partition

阅读更多关于 Spark 2 iterating over a partition to create a new partition

问题 I have been scratching my head trying to come up with a way to reduce a dataframe in spark to a frame which records gaps in the dataframe, preferably without completely killing parallelism. Here is a much-simplified example (It's a bit lengthy because I wanted it to be able to run): import org.apache.spark.sql.SparkSession case class Record(typ: String, start: Int, end: Int); object Sample { def main(argv: Array[String]): Unit = { val sparkSession = SparkSession.builder() .master("local")