pyspark

Removing rows in a nested struct in a spark dataframe using PySpark (details in text)

不羁的心 提交于 2020-05-15 21:21:07
问题 I am using pyspark and I have a dataframe object df and this is what the output of df.printSchema() looks like root |-- M_MRN: string (nullable = true) |-- measurements: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- Observation_ID: string (nullable = true) | | |-- Observation_Name: string (nullable = true) | | |-- Observation_Result: string (nullable = true) I would like to filter out all the arrays in 'measurements' where the Observation_ID is not '5' or '10'.

Removing rows in a nested struct in a spark dataframe using PySpark (details in text)

十年热恋 提交于 2020-05-15 21:20:59
问题 I am using pyspark and I have a dataframe object df and this is what the output of df.printSchema() looks like root |-- M_MRN: string (nullable = true) |-- measurements: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- Observation_ID: string (nullable = true) | | |-- Observation_Name: string (nullable = true) | | |-- Observation_Result: string (nullable = true) I would like to filter out all the arrays in 'measurements' where the Observation_ID is not '5' or '10'.

If dataframes in Spark are immutable, why are we able to modify it with operations such as withColumn()?

陌路散爱 提交于 2020-05-15 10:23:47
问题 This is probably a stupid question originating from my ignorance. I have been working on PySpark for a few weeks now and do not have much programming experience to start with. My understanding is that in Spark, RDDs, Dataframes, and Datasets are all immutable - which, again I understand, means you cannot change the data. If so, why are we able to edit a Dataframe's existing column using withColumn() ? 回答1: As per Spark Architecture DataFrame is built on top of RDDs which are immutable in

How do I handle errors in mapped functions in AWS Glue?

此生再无相见时 提交于 2020-05-15 08:47:07
问题 I'm using the map method of DynamicFrame (or, equivalently, the Map.apply method). I've noticed that any errors in the function that I pass to these functions are silently ignored and cause the returned DynamicFrame to be empty. Say I have a job script like this: import sys from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.transforms import * glueContext = GlueContext(SparkContext.getOrCreate()) dyF = glueContext.create_dynamic_frame.from_catalog

Split Spark DataFrame into two DataFrames (70% and 30% ) based on id column by preserving order

房东的猫 提交于 2020-05-15 08:45:11
问题 I have a spark dataframe which is like id start_time feature 1 01-01-2018 3.567 1 01-02-2018 4.454 1 01-03-2018 6.455 2 01-02-2018 343.4 2 01-08-2018 45.4 3 02-04-2018 43.56 3 02-07-2018 34.56 3 03-07-2018 23.6 I want to be able to split this into two dataframes based on the id column .So I should groupby the id column, sort by start_time and take 70% of the rows into one dataframe and 30% of the rows into another dataframe by preserving the order.The result should look like: Dataframe1: id

How to Compare Strings without case sensitive in Spark RDD?

生来就可爱ヽ(ⅴ<●) 提交于 2020-05-15 05:08:11
问题 I have following Dataset drug_name,num_prescriber,total_cost AMBIEN,2,300 BENZTROPINE MESYLATE,1,1500 CHLORPROMAZINE,2,3000 Wanted to find out number of A's and B's from above DataSet along with the header. I am using the following code to find out num of A's and number of B's. from pyspark import SparkContext from pyspark.sql import SparkSession logFile = 'Sample.txt' spark = SparkSession.builder.appName('GD App').getOrCreate() logData = spark.read.text(logFile).cache() numAs = logData

Calling another custom Python function from Pyspark UDF

点点圈 提交于 2020-05-15 02:51:04
问题 Suppose you have a file, let's call it udfs.py and in it: def nested_f(x): return x + 1 def main_f(x): return nested_f(x) + 1 You then want to make a UDF out of the main_f function and run it on a dataframe: import pyspark.sql.functions as fn import pandas as pd pdf = pd.DataFrame([[1], [2], [3]], columns=['x']) df = spark.createDataFrame(pdf) _udf = fn.udf(main_f, 'int') df.withColumn('x1', _udf(df['x'])).show() This works OK if we do this from within the same file as where the two functions

Spark: converting GMT time stamps to Eastern taking daylight savings into account

给你一囗甜甜゛ 提交于 2020-05-14 18:13:52
问题 I'm trying to convert a column of GMT timestamp strings into a column of timestamps in Eastern timezone. I want to take daylight savings into account. My column of timestamp strings look like this: '2017-02-01T10:15:21+00:00' I figured out how to convert the string column into a timestamp in EST: from pyspark.sql import functions as F df2 = df1.withColumn('datetimeGMT', df1.myTimeColumnInGMT.cast('timestamp')) df3 = df2.withColumn('datetimeEST', F.from_utc_timestamp(df2.datetimeGMT, "EST"))

ModuleNotFoundError: No module named 'py4j'

早过忘川 提交于 2020-05-14 08:43:09
问题 I installed Spark and I am running into problems loading the pyspark module into ipython. I'm getting the following error: ModuleNotFoundError Traceback (most recent call last) <ipython-input-2-49d7c4e178f8> in <module> ----> 1 import pyspark /opt/spark/python/pyspark/__init__.py in <module> 44 45 from pyspark.conf import SparkConf ---> 46 from pyspark.context import SparkContext 47 from pyspark.rdd import RDD 48 from pyspark.files import SparkFiles /opt/spark/python/pyspark/context.py in

Unable to import SparkContext

我与影子孤独终老i 提交于 2020-05-14 02:25:51
问题 I'm working on CentOS, I've setup $SPARK_HOME and also added path to bin in $PATH . I can run pyspark from anywhere. But when I try to create python file and uses this statement; from pyspark import SparkConf, SparkContext it throws following error python pysparktask.py Traceback (most recent call last): File "pysparktask.py", line 1, in <module> from pyspark import SparkConf, SparkContext ModuleNotFoundError: No module named 'pyspark' I tried to install it again using pip . pip install