pyspark

Load json file to spark dataframe

邮差的信 提交于 2021-01-29 20:16:40
问题 I try to load the following data.json file in a spark dataframe: {"positionmessage":{"callsign": "PPH1", "name": 0.0, "mmsi": 100}} {"positionmessage":{"callsign": "PPH2", "name": 0.0, "mmsi": 200}} {"positionmessage":{"callsign": "PPH3", "name": 0.0, "mmsi": 300}} by the following code: from pyspark.sql import SparkSession from pyspark.sql.types import ArrayType, StructField, StructType, StringType, IntegerType appName = "PySpark Example - JSON file to Spark Data Frame" master = "local" #

PySpark and time series data: how to smartly avoid overlapping dates?

走远了吗. 提交于 2021-01-29 18:40:26
问题 I have the following sample Spark dataframe import pandas as pd import pyspark import pyspark.sql.functions as fn from pyspark.sql.window import Window raw_df = pd.DataFrame([ (1115, dt.datetime(2019,8,5,18,20), dt.datetime(2019,8,5,18,40)), (484, dt.datetime(2019,8,5,18,30), dt.datetime(2019,8,9,18,40)), (484, dt.datetime(2019,8,4,18,30), dt.datetime(2019,8,6,18,40)), (484, dt.datetime(2019,8,2,18,30), dt.datetime(2019,8,3,18,40)), (484, dt.datetime(2019,8,7,18,50), dt.datetime(2019,8,9,18

Perform a user defined function on a column of a large pyspark dataframe based on some columns of another pyspark dataframe on databricks

三世轮回 提交于 2021-01-29 18:10:15
问题 My question is relevant to my previous one at How to efficiently join large pyspark dataframes and small python list for some NLP results on databricks. I have worked out part of it and now stuck by another problem. I have a small pyspark dataframe like : df1: +-----+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+ |topic| termIndices| termWeights| terms| +-----+---------------------------

How to create BinaryType Column using multiple columns of a pySpark Dataframe?

血红的双手。 提交于 2021-01-29 17:53:58
问题 I have recently started working with pySpark so don't know about many details regarding this. I am trying to create a BinaryType column in a data frame? But struggling to do it... for example, let's take a simple df df.show(2) +---+----------+ | col1|col2| +---+----------+ | "1"| null| | "2"| "20"| +---+----------+ Now I want to have a third column "col3" with BinaryType like | col1|col2| col3| +---+----------+ | "1"| null|[1 null] | "2"| "20"|[ 2 20] +---+----------+ How should i do it? 回答1:

Combine pivoted and aggregated column in PySpark Dataframe

有些话、适合烂在心里 提交于 2021-01-29 17:48:26
问题 My question is related to this one. I have a PySpark DataFrame, named df , as shown below. date | recipe | percent | volume ---------------------------------------- 2019-01-01 | A | 0.03 | 53 2019-01-01 | A | 0.02 | 55 2019-01-01 | B | 0.05 | 60 2019-01-02 | A | 0.11 | 75 2019-01-02 | B | 0.06 | 64 2019-01-02 | B | 0.08 | 66 If I pivot it on recipe and aggregate both percent and volume , I get column names that concatenate recipe and the aggregated variable. I can use alias to clean things up

Getting error on connecting to a local SQL Server database to databricks via JDBC connection

房东的猫 提交于 2021-01-29 17:40:34
问题 Basically I'm trying to connect to a SQL Server database on my local machine from databricks using a JDBC connection. I'm following the procedure mentioned in the documentation as shown here on the databricks website. I used the following code as mentioned on the website: jdbcHostname = "localhost" jdbcDatabase = "TestDB" jdbcPort = "3306" jdbcUrl = "jdbc:mysql://{0}:{1}/{2}".format(jdbcHostname, jdbcPort, jdbcDatabase) connectionProperties = { "jdbcUsername" : "user1", "jdbcPassword" :

Start index with certain value zipWithIndex in pyspark

六月ゝ 毕业季﹏ 提交于 2021-01-29 17:17:04
问题 I want to start value of indexes in data frame with certain value instead of default value zero, if there is any parameter we can use for zipWithIndex() in pyspark. 回答1: the following solution will help to start zipwithIndex with default value. df = df_child.rdd.zipWithIndex().map(lambda x: (x[0], x[1] + index)).toDF() where index is default number you want to start with zipWithIndex. 来源: https://stackoverflow.com/questions/60124599/start-index-with-certain-value-zipwithindex-in-pyspark

To run Spark Submit programs from a different cluster (1**.1*.0.21) in airflow (1**.1*.0.35). How to connect remotely other cluster in airflow

不羁的心 提交于 2021-01-29 16:35:03
问题 I have been trying to SparkSubmit programs in Airflow, but spark files are in a different cluster (1**.1*.0.21) and airflow is in (1**.1*.0.35). I am looking for a detailed explanation of this topic with examples. I cant copy or download any xml files or other files to my airflow cluster. When I try in SSH hook it says. Though I have many doubts using SSH Operator and BashOperator. Broken DAG: [/opt/airflow/dags/s.py] No module named paramiko 回答1: You can try using Livy In the following

SAS Proc Freq with PySpark (Frequency, percent, cumulative frequency, and cumulative percent)

纵然是瞬间 提交于 2021-01-29 15:29:09
问题 I'm looking for a way to reproduce the SAS Proc Freq code in PySpark. I found this code that does exactly what I need. However, it is given in Pandas. I want to make sure it does use the best what Spark can offer, as the code will run with massive datasets. In this other post (which was also adapted for this StackOverflow answer), I also found instructions to compute distributed groupwise cumulative sums in PySpark, but not sure how to adapt it to my end. Here's an input and output example

Rowwise sum per group and add total as a new row in dataframe in Pyspark

自古美人都是妖i 提交于 2021-01-29 14:50:09
问题 I have a dataframe like this sample df = spark.createDataFrame( [(2, "A" , "A2" , 2500), (2, "A" , "A11" , 3500), (2, "A" , "A12" , 5500), (4, "B" , "B25" , 7600), (4, "B", "B26" ,5600), (5, "C" , "c25" ,2658), (5, "C" , "c27" , 1100), (5, "C" , "c28" , 1200)], ['parent', 'group' , "brand" , "usage"]) output : +------+-----+-----+-----+ |parent|group|brand|usage| +------+-----+-----+-----+ | 2| A| A2| 2500| | 2| A| A11| 3500| | 4| B| B25| 7600| | 4| B| B26| 5600| | 5| C| c25| 2658| | 5| C|