databricks

How to set jdbc/partitionColumn type to Date in spark 2.4.1

我的梦境 提交于 2019-12-01 09:37:40
I am trying to retrieve data from oracle using spark-sql-2.4.1 version. I tried to set the JdbcOptions as below : .option("lowerBound", "31-MAR-02"); .option("upperBound", "01-MAY-19"); .option("partitionColumn", "data_date"); .option("numPartitions", 240); But gives error : java.lang.IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff] at java.sql.Timestamp.valueOf(Timestamp.java:204) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.toInternalBoundValue(JDBCRelation.scala:179) Then tried as below .option("lowerBound", "2002-03-31"); //changed the

Specify multiple columns data type changes to different data types in pyspark

Deadly 提交于 2019-12-01 01:46:26
I have a DataFrame ( df ) which consists of more than 50 columns and different types of data types, such as df3.printSchema() CtpJobId: string (nullable = true) |-- TransformJobStateId: string (nullable = true) |-- LastError: string (nullable = true) |-- PriorityDate: string (nullable = true) |-- QueuedTime: string (nullable = true) |-- AccurateAsOf: string (nullable = true) |-- SentToDevice: string (nullable = true) |-- StartedAtDevice: string (nullable = true) |-- ProcessStart: string (nullable = true) |-- LastProgressAt: string (nullable = true) |-- ProcessEnd: string (nullable = true) |--

PySpark - String matching to create new column

你离开我真会死。 提交于 2019-11-30 18:30:38
I have a dataframe like: ID Notes 2345 Checked by John 2398 Verified by Stacy 3983 Double Checked on 2/23/17 by Marsha Let's say for example there are only 3 employees to check: John, Stacy, or Marsha. I'd like to make a new column like so: ID Notes Employee 2345 Checked by John John 2398 Verified by Stacy Stacy 3983 Double Checked on 2/23/17 by Marsha Marsha Is regex or grep better here? What kind of function should I try? Thanks! EDIT: I've been trying a bunch of solutions, but nothing seems to work. Should I give up and instead create columns for each employee, with a binary value? IE: ID

Get the size/length of an array column

允我心安 提交于 2019-11-30 16:49:59
问题 I'm new in Scala programming and this is my question: How to count the number of string for each row? My Dataframe is composed of a single column of Array[String] type. friendsDF: org.apache.spark.sql.DataFrame = [friends: array<string>] 回答1: You can use the size function: val df = Seq((Array("a","b","c"), 2), (Array("a"), 4)).toDF("friends", "id") // df: org.apache.spark.sql.DataFrame = [friends: array<string>, id: int] df.select(size($"friends").as("no_of_friends")).show +-------------+ |no

Spark SQL get max & min dynamically from datasource

我们两清 提交于 2019-11-30 09:29:55
问题 I am using Spark SQL where I want to fetch whole data everyday from a Oracle table(consist of more than 1800k records). The application is hanging up when I read from Oracle hence I used concept of partitionColumn,lowerBound & upperBound . But,the problem is how can I get l owerBound & upperBound value of primary key column dynamically ?? Every day value of lowerBound & upperBound will be changing.Hence how can I get the boundary values of primary key column dynamically?? Can anyone guide me

PySpark - String matching to create new column

浪子不回头ぞ 提交于 2019-11-30 02:45:55
问题 I have a dataframe like: ID Notes 2345 Checked by John 2398 Verified by Stacy 3983 Double Checked on 2/23/17 by Marsha Let's say for example there are only 3 employees to check: John, Stacy, or Marsha. I'd like to make a new column like so: ID Notes Employee 2345 Checked by John John 2398 Verified by Stacy Stacy 3983 Double Checked on 2/23/17 by Marsha Marsha Is regex or grep better here? What kind of function should I try? Thanks! EDIT: I've been trying a bunch of solutions, but nothing

Spark SQL get max & min dynamically from datasource

回眸只為那壹抹淺笑 提交于 2019-11-29 15:55:08
I am using Spark SQL where I want to fetch whole data everyday from a Oracle table(consist of more than 1800k records). The application is hanging up when I read from Oracle hence I used concept of partitionColumn,lowerBound & upperBound . But,the problem is how can I get l owerBound & upperBound value of primary key column dynamically ?? Every day value of lowerBound & upperBound will be changing.Hence how can I get the boundary values of primary key column dynamically?? Can anyone guide me an sample example for my problem? Just fetch required values from the database: url = ... properties =

Simplest method for text lemmatization in Scala and Spark

有些话、适合烂在心里 提交于 2019-11-29 15:26:44
问题 I want to use lemmatization on a text file: surprise heard thump opened door small seedy man clasping package wrapped. upgrading system found review spring 2008 issue moody audio backed. omg left gotta wrap review order asap . understand hand delivered dali lama speak hands wear earplugs lives . listen maintain link long . cables cables finally able hear gem long rumored music . ... and expected output is : surprise heard thump open door small seed man clasp package wrap. upgrade system found

Spark dataframe save in single file on hdfs location [duplicate]

久未见 提交于 2019-11-28 07:06:20
This question already has an answer here: How to save RDD data into json files, not folders 2 answers I have dataframe and i want to save in single file on hdfs location. i found the solution here Write single CSV file using spark-csv df.coalesce(1) .write.format("com.databricks.spark.csv") .option("header", "true") .save("mydata.csv") But all data will be written to mydata.csv/part-00000 and i wanted to be mydata.csv file. is that possible? any help appreciate It's not possible using standard spark library, but you can use Hadoop API for managing filesystem - save output in temporary

Spark dataframe save in single file on hdfs location [duplicate]

自作多情 提交于 2019-11-26 14:02:58
问题 This question already has answers here : How to save RDD data into json files, not folders (2 answers) Closed 2 years ago . I have dataframe and i want to save in single file on hdfs location. i found the solution here Write single CSV file using spark-csv df.coalesce(1) .write.format("com.databricks.spark.csv") .option("header", "true") .save("mydata.csv") But all data will be written to mydata.csv/part-00000 and i wanted to be mydata.csv file. is that possible? any help appreciate 回答1: It's