apache-spark-sql | 易学教程

How to skip first and last line from a dat file and make it to dataframe using scala in databricks

阅读更多关于 How to skip first and last line from a dat file and make it to dataframe using scala in databricks

问题 H|*|D|*|PA|*|BJ|*|S|*|2019.05.27 08:54:24|##| H|*|AP_ATTR_ID|*|AP_ID|*|OPER_ID|*|ATTR_ID|*|ATTR_GROUP|*|LST_UPD_USR|*|LST_UPD_TSTMP|##| 779045|*|Sar|*|SUPERVISOR HIERARCHY|*|Supervisor|*|2|*|128|*|2019.05.14 16:48:16|##| 779048|*|KK|*|SUPERVISOR HIERARCHY|*|Supervisor|*|2|*|116|*|2019.05.14 16:59:02|##| 779054|*|Nisha - A|*|EXACT|*|CustomColumnRow120|*|2|*|1165|*|2019.05.15 12:11:48|##| T|*||*|2019.05.27 08:54:28|##| file name is PA.dat. I need to skip first line and also last line of the

Pyspark Schema for Json file

阅读更多关于 Pyspark Schema for Json file

问题 I am trying to read a complex json file into a spark dataframe . Spark recognizes the schema but mistakes a field as string which happens to be an empty array. (Not sure why it is String type when it has to be an array type) Below is a sample that i am expecting arrayfield:[{"name":"somename"},{"address" : "someadress"}] Right now the data is as below arrayfield:[] what this does to my code is that when ever i try querying arrayfield.name it fails. I know i can input a schema while reading

Pyspark: weighted average by a column

阅读更多关于 Pyspark: weighted average by a column

问题 For example, I have a dataset like this test = spark.createDataFrame([ (0, 1, 5, "2018-06-03", "Region A"), (1, 1, 2, "2018-06-04", "Region B"), (2, 2, 1, "2018-06-03", "Region B"), (3, 3, 1, "2018-06-01", "Region A"), (3, 1, 3, "2018-06-05", "Region A"), ])\ .toDF("orderid", "customerid", "price", "transactiondate", "location") test.show() and I can obtain the customer-region order count matrix by overall_stat = test.groupBy("customerid").agg(count("orderid"))\ .withColumnRenamed("count

Pyspark: weighted average by a column

阅读更多关于 Pyspark: weighted average by a column

Spark Get only columns that have one or more null values

阅读更多关于 Spark Get only columns that have one or more null values

问题 From a dataframe I want to get names of columns which contain at least one null value inside. Considering the dataframe below: val dataset = sparkSession.createDataFrame(Seq( (7, null, 18, 1.0), (8, "CA", null, 0.0), (9, "NZ", 15, 0.0) )).toDF("id", "country", "hour", "clicked") I want to get column names 'Country' and 'Hour'. id country hour clicked 7 null 18 1 8 "CA" null 0 9 "NZ" 15 0 回答1: this is one solution, but it's a bit awkward, I hope there is an easier way: val cols = dataset

Does Hive preserve file order when selecting data

阅读更多关于 Does Hive preserve file order when selecting data

问题 If I do select * from table1; in which order data will retrieve File order Or random order 回答1: Without ORDER BY the order is not guaranteed. Data is being read in parallel by many processes (mappers), after splits were calculated, each process starts reading some piece of file or few files, depending on splits calculated. All parallel processes can process different volume of data and running on different nodes, the load is not the same each time, so they start returning rows and finishing

Spark: Replace Null value in a Nested column

阅读更多关于 Spark: Replace Null value in a Nested column

问题 I would like to replace all the n/a values in the below dataframe to unknown . It can be either scalar or complex nested column . If it's a StructField column I can loop through the columns and replace n\a using WithColumn . But I would like this to be done in a generic way inspite of the type of the column as I dont want to specify the column names explicitly as there are 100's in my case? case class Bar(x: Int, y: String, z: String) case class Foo(id: Int, name: String, status: String, bar:

How to insert a custom function within For loop in pyspark?

阅读更多关于 How to insert a custom function within For loop in pyspark?

问题 I am facing a challenge in spark within Azure databricks. I have a dataset as +------------------+----------+-------------------+---------------+ | OpptyHeaderID| OpptyID| Date|BaseAmountMonth| +------------------+----------+-------------------+---------------+ |0067000000i6ONPAA2|OP-0164615|2014-07-27 00:00:00| 4375.800000| |0065w0000215k5kAAA|OP-0218055|2020-12-23 00:00:00| 4975.000000| +------------------+----------+-------------------+---------------+ Now I need to use a loop function to

Spark Dataframe: Select distinct rows

阅读更多关于 Spark Dataframe: Select distinct rows

问题 I tried two ways to find distinct rows from parquet but it doesn't seem to work. Attemp 1: Dataset<Row> df = sqlContext.read().parquet("location.parquet").distinct(); But throws Cannot have map type columns in DataFrame which calls set operations (intersect, except, etc.), but the type of column canvasHashes is map<string,string>;; Attemp 2: Tried running sql queries: Dataset<Row> df = sqlContext.read().parquet("location.parquet"); rawLandingDS.createOrReplaceTempView("df"); Dataset<Row>

Spark Dataframe: Select distinct rows

阅读更多关于 Spark Dataframe: Select distinct rows