pyspark | 易学教程

Removing rows in a nested struct in a spark dataframe using PySpark (details in text)

阅读更多关于 Removing rows in a nested struct in a spark dataframe using PySpark (details in text)

问题 I am using pyspark and I have a dataframe object df and this is what the output of df.printSchema() looks like root |-- M_MRN: string (nullable = true) |-- measurements: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- Observation_ID: string (nullable = true) | | |-- Observation_Name: string (nullable = true) | | |-- Observation_Result: string (nullable = true) I would like to filter out all the arrays in 'measurements' where the Observation_ID is not '5' or '10'.

Removing rows in a nested struct in a spark dataframe using PySpark (details in text)

阅读更多关于 Removing rows in a nested struct in a spark dataframe using PySpark (details in text)

If dataframes in Spark are immutable, why are we able to modify it with operations such as withColumn()?

阅读更多关于 If dataframes in Spark are immutable, why are we able to modify it with operations such as withColumn()?

问题 This is probably a stupid question originating from my ignorance. I have been working on PySpark for a few weeks now and do not have much programming experience to start with. My understanding is that in Spark, RDDs, Dataframes, and Datasets are all immutable - which, again I understand, means you cannot change the data. If so, why are we able to edit a Dataframe's existing column using withColumn() ? 回答1: As per Spark Architecture DataFrame is built on top of RDDs which are immutable in

How do I handle errors in mapped functions in AWS Glue?

阅读更多关于 How do I handle errors in mapped functions in AWS Glue?

问题 I'm using the map method of DynamicFrame (or, equivalently, the Map.apply method). I've noticed that any errors in the function that I pass to these functions are silently ignored and cause the returned DynamicFrame to be empty. Say I have a job script like this: import sys from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.transforms import * glueContext = GlueContext(SparkContext.getOrCreate()) dyF = glueContext.create_dynamic_frame.from_catalog

Split Spark DataFrame into two DataFrames (70% and 30% ) based on id column by preserving order

阅读更多关于 Split Spark DataFrame into two DataFrames (70% and 30% ) based on id column by preserving order

问题 I have a spark dataframe which is like id start_time feature 1 01-01-2018 3.567 1 01-02-2018 4.454 1 01-03-2018 6.455 2 01-02-2018 343.4 2 01-08-2018 45.4 3 02-04-2018 43.56 3 02-07-2018 34.56 3 03-07-2018 23.6 I want to be able to split this into two dataframes based on the id column .So I should groupby the id column, sort by start_time and take 70% of the rows into one dataframe and 30% of the rows into another dataframe by preserving the order.The result should look like: Dataframe1: id

How to Compare Strings without case sensitive in Spark RDD?

阅读更多关于 How to Compare Strings without case sensitive in Spark RDD?

问题 I have following Dataset drug_name,num_prescriber,total_cost AMBIEN,2,300 BENZTROPINE MESYLATE,1,1500 CHLORPROMAZINE,2,3000 Wanted to find out number of A's and B's from above DataSet along with the header. I am using the following code to find out num of A's and number of B's. from pyspark import SparkContext from pyspark.sql import SparkSession logFile = 'Sample.txt' spark = SparkSession.builder.appName('GD App').getOrCreate() logData = spark.read.text(logFile).cache() numAs = logData

Calling another custom Python function from Pyspark UDF

阅读更多关于 Calling another custom Python function from Pyspark UDF

问题 Suppose you have a file, let's call it udfs.py and in it: def nested_f(x): return x + 1 def main_f(x): return nested_f(x) + 1 You then want to make a UDF out of the main_f function and run it on a dataframe: import pyspark.sql.functions as fn import pandas as pd pdf = pd.DataFrame([[1], [2], [3]], columns=['x']) df = spark.createDataFrame(pdf) _udf = fn.udf(main_f, 'int') df.withColumn('x1', _udf(df['x'])).show() This works OK if we do this from within the same file as where the two functions

Spark: converting GMT time stamps to Eastern taking daylight savings into account

阅读更多关于 Spark: converting GMT time stamps to Eastern taking daylight savings into account

问题 I'm trying to convert a column of GMT timestamp strings into a column of timestamps in Eastern timezone. I want to take daylight savings into account. My column of timestamp strings look like this: '2017-02-01T10:15:21+00:00' I figured out how to convert the string column into a timestamp in EST: from pyspark.sql import functions as F df2 = df1.withColumn('datetimeGMT', df1.myTimeColumnInGMT.cast('timestamp')) df3 = df2.withColumn('datetimeEST', F.from_utc_timestamp(df2.datetimeGMT, "EST"))

ModuleNotFoundError: No module named 'py4j'

阅读更多关于 ModuleNotFoundError: No module named 'py4j'

问题 I installed Spark and I am running into problems loading the pyspark module into ipython. I'm getting the following error: ModuleNotFoundError Traceback (most recent call last) <ipython-input-2-49d7c4e178f8> in <module> ----> 1 import pyspark /opt/spark/python/pyspark/__init__.py in <module> 44 45 from pyspark.conf import SparkConf ---> 46 from pyspark.context import SparkContext 47 from pyspark.rdd import RDD 48 from pyspark.files import SparkFiles /opt/spark/python/pyspark/context.py in

Unable to import SparkContext

阅读更多关于 Unable to import SparkContext

问题 I'm working on CentOS, I've setup $SPARK_HOME and also added path to bin in $PATH . I can run pyspark from anywhere. But when I try to create python file and uses this statement; from pyspark import SparkConf, SparkContext it throws following error python pysparktask.py Traceback (most recent call last): File "pysparktask.py", line 1, in <module> from pyspark import SparkConf, SparkContext ModuleNotFoundError: No module named 'pyspark' I tried to install it again using pip . pip install