apache-spark-sql

How to skip first and last line from a dat file and make it to dataframe using scala in databricks

老子叫甜甜 提交于 2021-02-19 08:59:30
问题 H|*|D|*|PA|*|BJ|*|S|*|2019.05.27 08:54:24|##| H|*|AP_ATTR_ID|*|AP_ID|*|OPER_ID|*|ATTR_ID|*|ATTR_GROUP|*|LST_UPD_USR|*|LST_UPD_TSTMP|##| 779045|*|Sar|*|SUPERVISOR HIERARCHY|*|Supervisor|*|2|*|128|*|2019.05.14 16:48:16|##| 779048|*|KK|*|SUPERVISOR HIERARCHY|*|Supervisor|*|2|*|116|*|2019.05.14 16:59:02|##| 779054|*|Nisha - A|*|EXACT|*|CustomColumnRow120|*|2|*|1165|*|2019.05.15 12:11:48|##| T|*||*|2019.05.27 08:54:28|##| file name is PA.dat. I need to skip first line and also last line of the

Pyspark Schema for Json file

你。 提交于 2021-02-19 08:14:06
问题 I am trying to read a complex json file into a spark dataframe . Spark recognizes the schema but mistakes a field as string which happens to be an empty array. (Not sure why it is String type when it has to be an array type) Below is a sample that i am expecting arrayfield:[{"name":"somename"},{"address" : "someadress"}] Right now the data is as below arrayfield:[] what this does to my code is that when ever i try querying arrayfield.name it fails. I know i can input a schema while reading

Pyspark: weighted average by a column

孤者浪人 提交于 2021-02-19 07:39:47
问题 For example, I have a dataset like this test = spark.createDataFrame([ (0, 1, 5, "2018-06-03", "Region A"), (1, 1, 2, "2018-06-04", "Region B"), (2, 2, 1, "2018-06-03", "Region B"), (3, 3, 1, "2018-06-01", "Region A"), (3, 1, 3, "2018-06-05", "Region A"), ])\ .toDF("orderid", "customerid", "price", "transactiondate", "location") test.show() and I can obtain the customer-region order count matrix by overall_stat = test.groupBy("customerid").agg(count("orderid"))\ .withColumnRenamed("count

Pyspark: weighted average by a column

梦想与她 提交于 2021-02-19 07:39:13
问题 For example, I have a dataset like this test = spark.createDataFrame([ (0, 1, 5, "2018-06-03", "Region A"), (1, 1, 2, "2018-06-04", "Region B"), (2, 2, 1, "2018-06-03", "Region B"), (3, 3, 1, "2018-06-01", "Region A"), (3, 1, 3, "2018-06-05", "Region A"), ])\ .toDF("orderid", "customerid", "price", "transactiondate", "location") test.show() and I can obtain the customer-region order count matrix by overall_stat = test.groupBy("customerid").agg(count("orderid"))\ .withColumnRenamed("count

Spark Get only columns that have one or more null values

混江龙づ霸主 提交于 2021-02-19 04:25:47
问题 From a dataframe I want to get names of columns which contain at least one null value inside. Considering the dataframe below: val dataset = sparkSession.createDataFrame(Seq( (7, null, 18, 1.0), (8, "CA", null, 0.0), (9, "NZ", 15, 0.0) )).toDF("id", "country", "hour", "clicked") I want to get column names 'Country' and 'Hour'. id country hour clicked 7 null 18 1 8 "CA" null 0 9 "NZ" 15 0 回答1: this is one solution, but it's a bit awkward, I hope there is an easier way: val cols = dataset

Does Hive preserve file order when selecting data

不打扰是莪最后的温柔 提交于 2021-02-19 04:05:44
问题 If I do select * from table1; in which order data will retrieve File order Or random order 回答1: Without ORDER BY the order is not guaranteed. Data is being read in parallel by many processes (mappers), after splits were calculated, each process starts reading some piece of file or few files, depending on splits calculated. All parallel processes can process different volume of data and running on different nodes, the load is not the same each time, so they start returning rows and finishing

Spark: Replace Null value in a Nested column

跟風遠走 提交于 2021-02-19 03:53:08
问题 I would like to replace all the n/a values in the below dataframe to unknown . It can be either scalar or complex nested column . If it's a StructField column I can loop through the columns and replace n\a using WithColumn . But I would like this to be done in a generic way inspite of the type of the column as I dont want to specify the column names explicitly as there are 100's in my case? case class Bar(x: Int, y: String, z: String) case class Foo(id: Int, name: String, status: String, bar:

How to insert a custom function within For loop in pyspark?

﹥>﹥吖頭↗ 提交于 2021-02-18 19:41:53
问题 I am facing a challenge in spark within Azure databricks. I have a dataset as +------------------+----------+-------------------+---------------+ | OpptyHeaderID| OpptyID| Date|BaseAmountMonth| +------------------+----------+-------------------+---------------+ |0067000000i6ONPAA2|OP-0164615|2014-07-27 00:00:00| 4375.800000| |0065w0000215k5kAAA|OP-0218055|2020-12-23 00:00:00| 4975.000000| +------------------+----------+-------------------+---------------+ Now I need to use a loop function to

Spark Dataframe: Select distinct rows

狂风中的少年 提交于 2021-02-18 17:00:20
问题 I tried two ways to find distinct rows from parquet but it doesn't seem to work. Attemp 1: Dataset<Row> df = sqlContext.read().parquet("location.parquet").distinct(); But throws Cannot have map type columns in DataFrame which calls set operations (intersect, except, etc.), but the type of column canvasHashes is map<string,string>;; Attemp 2: Tried running sql queries: Dataset<Row> df = sqlContext.read().parquet("location.parquet"); rawLandingDS.createOrReplaceTempView("df"); Dataset<Row>

Spark Dataframe: Select distinct rows

拜拜、爱过 提交于 2021-02-18 17:00:14
问题 I tried two ways to find distinct rows from parquet but it doesn't seem to work. Attemp 1: Dataset<Row> df = sqlContext.read().parquet("location.parquet").distinct(); But throws Cannot have map type columns in DataFrame which calls set operations (intersect, except, etc.), but the type of column canvasHashes is map<string,string>;; Attemp 2: Tried running sql queries: Dataset<Row> df = sqlContext.read().parquet("location.parquet"); rawLandingDS.createOrReplaceTempView("df"); Dataset<Row>