pyspark | 易学教程

Timezone conversion with pyspark from timestamp and country

阅读更多关于 Timezone conversion with pyspark from timestamp and country

问题 I'm trying to convert UTC date to date with local timezone (using the country) with PySpark. I have the country as string and the date as timestamp So the input is : date = Timestamp('2016-11-18 01:45:55') # type is pandas._libs.tslibs.timestamps.Timestamp country = "FR" # Type is string import pytz import pandas as pd def convert_date_spark(date, country): timezone = pytz.country_timezones(country)[0] local_time = date.replace(tzinfo = pytz.utc).astimezone(timezone) date, time = local_time

Use external library in pandas_udf in pyspark

阅读更多关于 Use external library in pandas_udf in pyspark

问题 It's possible to use a external library like textdistance inside pandas_udf? I have tried and I get this error: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). I have tried with Spark version 2.3.1. 回答1: You can package the textdistance together with your own code (use setup.py and bdist_egg to build an egg file), and specify the final package with option --py-files while you run spark. btw, the error message doesn't seem to relate

Use external library in pandas_udf in pyspark

阅读更多关于 Use external library in pandas_udf in pyspark

Rename Column in Athena

阅读更多关于 Rename Column in Athena

问题 Athena table "organization" reads data from parquet files in s3. I need to change a column name from "cost" to "fee" . The data files goes back to Jan 2018. If I just rename the column in Athena , table won't be able to find data for new column in parquet file. Please let me know if there ways to resolve it. 回答1: You have to change the schema and point to new column "fee" But it depends on ur situation. If you have two data sets, in one dataset it is called "cost" and in another dataset it is

How to relationalize JSON containing arrays

阅读更多关于 How to relationalize JSON containing arrays

问题 I am using AWS Glue to read data file containing JSON (on S3). This one is a JSON with data contained in array. I have tried using relationalize() function but it doesn't work on array. It does work on nested JSON but this is not the data format of input. Is there a way to relationalize JSON with arrays in it? Input data: { "ID":"1234", "territory":"US", "imgList":[ { "type":"box" "locale":"en-US" "url":"boxart/url.jpg" }, { "type":"square" "locale":"en-US" "url":"square/url.jpg" } ] } Code:

Convert Nested dictionary to Pyspark Dataframe

阅读更多关于 Convert Nested dictionary to Pyspark Dataframe

问题 Greetings to fellow programmer. I have recently started with pyspark and comes from a pandas background. I need to compute similarity of user in a data against each other. As I couldn't find from pyspark I resorted to use python dictionary to create a similarity dataframe. However, I run out of ideas to convert a nested dictionary into a pyspark Dataframe. Could you please provide me a direction on to achieve this desired result. import pyspark from pyspark.context import SparkContext from

Convert Nested dictionary to Pyspark Dataframe

阅读更多关于 Convert Nested dictionary to Pyspark Dataframe

Convert Nested dictionary to Pyspark Dataframe

阅读更多关于 Convert Nested dictionary to Pyspark Dataframe

Issue with creating a global list from map using PySpark

阅读更多关于 Issue with creating a global list from map using PySpark

问题 I have this code where I am reading a file in ipython using pyspark . What I am trying to do is to add a piece to it which forms a list based on a particular column read from the file but when I try to execute it then the list comes out to be empty and nothing gets appended to it. My code is: list1 = [] def file_read(line): list1.append(line[10]) # bunch of other code which process other column indexes on `line` inputData = sc.textFile(fileName).zipWithIndex().filter(lambda (line,rownum):

Rolling average without timestamp in pyspark

阅读更多关于 Rolling average without timestamp in pyspark

问题 We can find the rolling/moving average of a time series data using window function in pyspark. The data I am dealing with doesn't have any timestamp column but it does have a strictly increasing column frame_number . Data looks like this. d = [ {'session_id': 1, 'frame_number': 1, 'rtd': 11.0, 'rtd2': 11.0,}, {'session_id': 1, 'frame_number': 2, 'rtd': 12.0, 'rtd2': 6.0}, {'session_id': 1, 'frame_number': 3, 'rtd': 4.0, 'rtd2': 233.0}, {'session_id': 1, 'frame_number': 4, 'rtd': 110.0, 'rtd2'