apache-spark-sql | 易学教程

Count of all element less than the value in a row

阅读更多关于 Count of all element less than the value in a row

问题 Given a dataframe value ----- 0.3 0.2 0.7 0.5 is there a way to build a column that contains, for each row, the count of the element in that row that are less or equal than the row value? Specifically, value count_less_equal ------------------------- 0.3 2 0.2 1 0.7 4 0.5 3 I could groupBy the value column but I don't know how to filter all values in the row that are less that that value. I was thinking, maybe it's possible to duplicate the first column, then create a filter so that for each

Appending column name to column value using Spark

阅读更多关于 Appending column name to column value using Spark

问题 I have data in comma separated file, I have loaded it in the spark data frame: The data looks like: A B C 1 2 3 4 5 6 7 8 9 I want to transform the above data frame in spark using pyspark as: A B C A_1 B_2 C_3 A_4 B_5 C_6 -------------- Then convert it to list of list using pyspark as: [[ A_1 , B_2 , C_3],[A_4 , B_5 , C_6]] And then run FP Growth algorithm using pyspark on the above data set. The code that I have tried is below: from pyspark.sql.functions import col, size from pyspark.sql

How to refer a map column in a spark-sql query?

阅读更多关于 How to refer a map column in a spark-sql query?

问题 scala> val map1 = spark.sql("select map('p1', 's1', 'p2', 's2')") map1: org.apache.spark.sql.DataFrame = [map(p1, s1, p2, s2): map<string,string>] scala> map1.show() +--------------------+ | map(p1, s1, p2, s2)| +--------------------+ |[p1 -> s1, p2 -> s2]| +--------------------+ scala> spark.sql("select element_at(map1, 'p1')") org.apache.spark.sql.AnalysisException: cannot resolve ' map1 ' given input columns: []; line 1 pos 18; 'Project [unresolvedalias('element_at('map1, p1), None)] How

Databricks : Equivalent code for SQL query

阅读更多关于 Databricks : Equivalent code for SQL query

问题 I'm looking for the equivalent databricks code for the query. I added some sample code and the expected as well, but in particular I'm looking for the equivalent code in Databricks for the query . For the moment I'm stuck on the CROSS APPLY STRING SPLIT part. Sample SQL data: CREATE TABLE FactTurnover ( ID INT, SalesPriceExcl NUMERIC (9,4), Discount VARCHAR(100) ) INSERT INTO FactTurnover VALUES (1, 100, '10'), (2, 39.5877, '58, 12'), (3, 100, '50, 10, 15'), (4, 100, 'B') Query: ;WITH CTE AS

java.lang.NumberFormatException: For input string: “0.000” [duplicate]

阅读更多关于 java.lang.NumberFormatException: For input string: “0.000” [duplicate]

问题 This question already has answers here : What is a NumberFormatException and how can I fix it? (9 answers) Closed 1 year ago . I am trying to create a udf to take two strings as parameters; one in DD-MM-YYYY format (e.g. "14-10-2019") and the other in float format (e.g. "0.000"). I want to convert the float-like string to an int and add it to the date object to get another date which I want to return as a string. def getEndDate = udf{ (startDate: String, no_of_days : String) => val num_days =

java.lang.NumberFormatException: For input string: “0.000” [duplicate]

阅读更多关于 java.lang.NumberFormatException: For input string: “0.000” [duplicate]

Rename Column in Athena

阅读更多关于 Rename Column in Athena

问题 Athena table "organization" reads data from parquet files in s3. I need to change a column name from "cost" to "fee" . The data files goes back to Jan 2018. If I just rename the column in Athena , table won't be able to find data for new column in parquet file. Please let me know if there ways to resolve it. 回答1: You have to change the schema and point to new column "fee" But it depends on ur situation. If you have two data sets, in one dataset it is called "cost" and in another dataset it is

Efficiently batching Spark dataframes to call an API

阅读更多关于 Efficiently batching Spark dataframes to call an API

问题 I am fairly new to Spark and I'm trying to call the Spotify API using Spotipy. I have a list of artist ids which can be used to fetch artist info. The Spotify API allows for batch calls up to 50 ids at once. I load the artist ids from a MySQL database and store them in a dataframe. My problem now is that I do not know how to efficiently batch that dataframe into pieces of 50 or less rows. In the example below I'm turning the dataframe into a regular Python list from which I can call the API

Issue with df.show() in pyspark

阅读更多关于 Issue with df.show() in pyspark

问题 I have the following code: import pyspark import pandas as pd from pyspark.sql import SQLContext from pyspark.sql.functions import udf from pyspark.sql.types import IntegerType, StringType sc = pyspark.SparkContext() sqlCtx = SQLContext(sc) df_pd = pd.DataFrame( data={'integers': [1, 2, 3], 'floats': [-1.0, 0.5, 2.7], 'integer_arrays': [[1, 2], [3, 4, 5], [6, 7, 8, 9]]} ) df = sqlCtx.createDataFrame(df_pd) df.printSchema() Runs fine until here, but when I run: df.show() It gives this error: -

Is there a way to collect the names of all fields in a nested schema in pyspark

阅读更多关于 Is there a way to collect the names of all fields in a nested schema in pyspark

问题 I wish to collect the names of all the fields in a nested schema. The data were imported from a json file. The schema looks like: root |-- column_a: string (nullable = true) |-- column_b: string (nullable = true) |-- column_c: struct (nullable = true) | |-- nested_a: struct (nullable = true) | | |-- double_nested_a: string (nullable = true) | | |-- double_nested_b: string (nullable = true) | | |-- double_nested_c: string (nullable = true) | |-- nested_b: string (nullable = true) |-- column_d: