pyspark

Timezone conversion with pyspark from timestamp and country

爱⌒轻易说出口 提交于 2021-01-28 18:44:31
问题 I'm trying to convert UTC date to date with local timezone (using the country) with PySpark. I have the country as string and the date as timestamp So the input is : date = Timestamp('2016-11-18 01:45:55') # type is pandas._libs.tslibs.timestamps.Timestamp country = "FR" # Type is string import pytz import pandas as pd def convert_date_spark(date, country): timezone = pytz.country_timezones(country)[0] local_time = date.replace(tzinfo = pytz.utc).astimezone(timezone) date, time = local_time

Use external library in pandas_udf in pyspark

依然范特西╮ 提交于 2021-01-28 18:39:09
问题 It's possible to use a external library like textdistance inside pandas_udf? I have tried and I get this error: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). I have tried with Spark version 2.3.1. 回答1: You can package the textdistance together with your own code (use setup.py and bdist_egg to build an egg file), and specify the final package with option --py-files while you run spark. btw, the error message doesn't seem to relate

Use external library in pandas_udf in pyspark

那年仲夏 提交于 2021-01-28 18:31:40
问题 It's possible to use a external library like textdistance inside pandas_udf? I have tried and I get this error: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). I have tried with Spark version 2.3.1. 回答1: You can package the textdistance together with your own code (use setup.py and bdist_egg to build an egg file), and specify the final package with option --py-files while you run spark. btw, the error message doesn't seem to relate

Rename Column in Athena

时光总嘲笑我的痴心妄想 提交于 2021-01-28 14:27:05
问题 Athena table "organization" reads data from parquet files in s3. I need to change a column name from "cost" to "fee" . The data files goes back to Jan 2018. If I just rename the column in Athena , table won't be able to find data for new column in parquet file. Please let me know if there ways to resolve it. 回答1: You have to change the schema and point to new column "fee" But it depends on ur situation. If you have two data sets, in one dataset it is called "cost" and in another dataset it is

How to relationalize JSON containing arrays

假如想象 提交于 2021-01-28 14:09:14
问题 I am using AWS Glue to read data file containing JSON (on S3). This one is a JSON with data contained in array. I have tried using relationalize() function but it doesn't work on array. It does work on nested JSON but this is not the data format of input. Is there a way to relationalize JSON with arrays in it? Input data: { "ID":"1234", "territory":"US", "imgList":[ { "type":"box" "locale":"en-US" "url":"boxart/url.jpg" }, { "type":"square" "locale":"en-US" "url":"square/url.jpg" } ] } Code:

Convert Nested dictionary to Pyspark Dataframe

匆匆过客 提交于 2021-01-28 13:34:46
问题 Greetings to fellow programmer. I have recently started with pyspark and comes from a pandas background. I need to compute similarity of user in a data against each other. As I couldn't find from pyspark I resorted to use python dictionary to create a similarity dataframe. However, I run out of ideas to convert a nested dictionary into a pyspark Dataframe. Could you please provide me a direction on to achieve this desired result. import pyspark from pyspark.context import SparkContext from

Convert Nested dictionary to Pyspark Dataframe

被刻印的时光 ゝ 提交于 2021-01-28 13:33:59
问题 Greetings to fellow programmer. I have recently started with pyspark and comes from a pandas background. I need to compute similarity of user in a data against each other. As I couldn't find from pyspark I resorted to use python dictionary to create a similarity dataframe. However, I run out of ideas to convert a nested dictionary into a pyspark Dataframe. Could you please provide me a direction on to achieve this desired result. import pyspark from pyspark.context import SparkContext from

Convert Nested dictionary to Pyspark Dataframe

删除回忆录丶 提交于 2021-01-28 13:33:57
问题 Greetings to fellow programmer. I have recently started with pyspark and comes from a pandas background. I need to compute similarity of user in a data against each other. As I couldn't find from pyspark I resorted to use python dictionary to create a similarity dataframe. However, I run out of ideas to convert a nested dictionary into a pyspark Dataframe. Could you please provide me a direction on to achieve this desired result. import pyspark from pyspark.context import SparkContext from

Issue with creating a global list from map using PySpark

一笑奈何 提交于 2021-01-28 12:22:48
问题 I have this code where I am reading a file in ipython using pyspark . What I am trying to do is to add a piece to it which forms a list based on a particular column read from the file but when I try to execute it then the list comes out to be empty and nothing gets appended to it. My code is: list1 = [] def file_read(line): list1.append(line[10]) # bunch of other code which process other column indexes on `line` inputData = sc.textFile(fileName).zipWithIndex().filter(lambda (line,rownum):

Rolling average without timestamp in pyspark

我怕爱的太早我们不能终老 提交于 2021-01-28 11:42:08
问题 We can find the rolling/moving average of a time series data using window function in pyspark. The data I am dealing with doesn't have any timestamp column but it does have a strictly increasing column frame_number . Data looks like this. d = [ {'session_id': 1, 'frame_number': 1, 'rtd': 11.0, 'rtd2': 11.0,}, {'session_id': 1, 'frame_number': 2, 'rtd': 12.0, 'rtd2': 6.0}, {'session_id': 1, 'frame_number': 3, 'rtd': 4.0, 'rtd2': 233.0}, {'session_id': 1, 'frame_number': 4, 'rtd': 110.0, 'rtd2'