pyspark

Rolling average without timestamp in pyspark

冷暖自知 提交于 2021-01-28 11:25:33
问题 We can find the rolling/moving average of a time series data using window function in pyspark. The data I am dealing with doesn't have any timestamp column but it does have a strictly increasing column frame_number . Data looks like this. d = [ {'session_id': 1, 'frame_number': 1, 'rtd': 11.0, 'rtd2': 11.0,}, {'session_id': 1, 'frame_number': 2, 'rtd': 12.0, 'rtd2': 6.0}, {'session_id': 1, 'frame_number': 3, 'rtd': 4.0, 'rtd2': 233.0}, {'session_id': 1, 'frame_number': 4, 'rtd': 110.0, 'rtd2'

How to add columns in pyspark dataframe dynamically

倾然丶 夕夏残阳落幕 提交于 2021-01-28 10:57:15
问题 I am trying to add few columns based on input variable vIssueCols from pyspark.sql import HiveContext from pyspark.sql import functions as F from pyspark.sql.window import Window vIssueCols=['jobid','locid'] vQuery1 = 'vSrcData2= vSrcData' vWindow1 = Window.partitionBy("vKey").orderBy("vOrderBy") for x in vIssueCols: Query1=vQuery1+'.withColumn("'+x+'_prev",F.lag(vSrcData.'+x+').over(vWindow1))' exec(vQuery1) now above query will generate vQuery1 as below, and it is working, but vSrcData2=

Efficiently batching Spark dataframes to call an API

谁说胖子不能爱 提交于 2021-01-28 10:52:55
问题 I am fairly new to Spark and I'm trying to call the Spotify API using Spotipy. I have a list of artist ids which can be used to fetch artist info. The Spotify API allows for batch calls up to 50 ids at once. I load the artist ids from a MySQL database and store them in a dataframe. My problem now is that I do not know how to efficiently batch that dataframe into pieces of 50 or less rows. In the example below I'm turning the dataframe into a regular Python list from which I can call the API

Jupyter + EMR + Spark - Connect to EMR cluster from Jupyter notebook on local machine

微笑、不失礼 提交于 2021-01-28 10:13:13
问题 I am new to PySpark and EMR. I am trying to access Spark running on EMR cluster through Jupyter notebook, but running into errors. I am generating SparkSession using following code: spark = SparkSession.builder \ .master("local[*]")\ .appName("Carbon - SingleWell parallelization on Spark")\ .getOrCreate() Tried following to access Remote cluster, but it errored out: spark = SparkSession.builder \ .master("spark://<remote-emr-ec2-hostname>:7077")\ .appName("Carbon - SingleWell parallelization

Issue with df.show() in pyspark

﹥>﹥吖頭↗ 提交于 2021-01-28 09:19:37
问题 I have the following code: import pyspark import pandas as pd from pyspark.sql import SQLContext from pyspark.sql.functions import udf from pyspark.sql.types import IntegerType, StringType sc = pyspark.SparkContext() sqlCtx = SQLContext(sc) df_pd = pd.DataFrame( data={'integers': [1, 2, 3], 'floats': [-1.0, 0.5, 2.7], 'integer_arrays': [[1, 2], [3, 4, 5], [6, 7, 8, 9]]} ) df = sqlCtx.createDataFrame(df_pd) df.printSchema() Runs fine until here, but when I run: df.show() It gives this error: -

Input data to AWS Elastic Search using Glue

被刻印的时光 ゝ 提交于 2021-01-28 09:03:01
问题 I'm looking for a solution to insert data to AWS Elastic Search using AWS Glue python or pyspark. I have seen Boto3 SDK for Elastic Search but could not find any function to insert data into Elastic Search. Can anyone help me to find solution ? Any useful links or code ? 回答1: For aws glue you need to add an additional jar to the job. Download the jar from https://repo1.maven.org/maven2/org/elasticsearch/elasticsearch-hadoop/7.8.0/elasticsearch-hadoop-7.8.0.jar Save the jar on s3 and pass it

Checking whether a column has proper decimal number

♀尐吖头ヾ 提交于 2021-01-28 08:55:14
问题 I have a dataframe ( input_dataframe ), which looks like as below: id test_column 1 0.25 2 1.1 3 12 4 test 5 1.3334 6 .11 I want to add a column result , which put values 1 if test_column has a decimal value and 0 if test_column has any other value. data type of test_column is string. Below is the expected output: id test_column result 1 0.25 1 2 1.1 1 3 12 0 4 test 0 5 1.3334 1 6 .11 1 Can we achieve it using pySpark code? 回答1: You can parse decimal token with decimal.Decimal() Here we are

Is there a way to collect the names of all fields in a nested schema in pyspark

孤者浪人 提交于 2021-01-28 08:40:33
问题 I wish to collect the names of all the fields in a nested schema. The data were imported from a json file. The schema looks like: root |-- column_a: string (nullable = true) |-- column_b: string (nullable = true) |-- column_c: struct (nullable = true) | |-- nested_a: struct (nullable = true) | | |-- double_nested_a: string (nullable = true) | | |-- double_nested_b: string (nullable = true) | | |-- double_nested_c: string (nullable = true) | |-- nested_b: string (nullable = true) |-- column_d:

pyspark dataframe with json column to aggregate the json elements into a new column and remove duplicated

血红的双手。 提交于 2021-01-28 08:02:36
问题 I am trying to read a pyspark dataframe with json column on databricks. The dataframe: year month json_col 2010 09 [{"p_id":"vfdvtbe"}, {"p_id":"cdscs"}, {"p_id":"usdvwq"}] 2010 09 [{"p_id":"ujhbe"}, {"p_id":"cdscs"}, {"p_id":"yjev"}] 2007 10 [{"p_id":"ukerge"}, {"p_id":"ikrtw"}, {"p_id":"ikwca"}] 2007 10 [{"p_id":"unvwq"}, {"p_id":"cqwcq"}, {"p_id":"ikwca"}] I need a new dataframe with all duplicated "p_id" are removed and aggregated by year and month year month p_id (string) 2010 09 [

Apache Spark 2.0 (PySpark) - DataFrame Error Multiple sources found for csv

↘锁芯ラ 提交于 2021-01-28 08:01:10
问题 I am trying to create a dataframe using the following code in Spark 2.0. While executing the code in Jupyter/Console, I am facing the below error. Can someone help me how to get rid of this error? Error: Py4JJavaError: An error occurred while calling o34.csv. : java.lang.RuntimeException: Multiple sources found for csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, com.databricks.spark.csv.DefaultSource15), please specify the fully qualified class name. at scala.sys.package$