pyspark

Converting epoch to datetime in PySpark data frame using udf

若如初见. 提交于 2020-06-25 04:03:11
问题 I have a PySpark dataframe with this schema: root |-- epoch: double (nullable = true) |-- var1: double (nullable = true) |-- var2: double (nullable = true) Where epoch is in seconds and should be converted to date time. In order to do so, I define a user defined function (udf) as follows: from pyspark.sql.functions import udf import time def epoch_to_datetime(x): return time.localtime(x) # return time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(x)) # return x * 0 + 1 epoch_to_datetime_udf =

Converting epoch to datetime in PySpark data frame using udf

给你一囗甜甜゛ 提交于 2020-06-25 04:02:59
问题 I have a PySpark dataframe with this schema: root |-- epoch: double (nullable = true) |-- var1: double (nullable = true) |-- var2: double (nullable = true) Where epoch is in seconds and should be converted to date time. In order to do so, I define a user defined function (udf) as follows: from pyspark.sql.functions import udf import time def epoch_to_datetime(x): return time.localtime(x) # return time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(x)) # return x * 0 + 1 epoch_to_datetime_udf =

Pyspark filter using startswith from list

有些话、适合烂在心里 提交于 2020-06-24 07:44:33
问题 I have a list of elements that may start a couple of strings that are of record in an RDD. If I have and element list of yes and no , they should match yes23 and no3 but not 35yes or 41no . Using pyspark, how can i use startswith any element in list or tuple. An example DF would be: +-----+------+ |index| label| +-----+------+ | 1|yes342| | 2| 45yes| | 3| no123| | 4| 75no| +-----+------+ When I try: Element_List = ['yes','no'] filter_DF = DF.where(DF.label.startswith(tuple(Element_List))) The

How do you automate pyspark jobs on emr using boto3 (or otherwise)?

吃可爱长大的小学妹 提交于 2020-06-24 04:51:08
问题 I am creating a job to parse massive amounts of server data, and then upload it into a Redshift database. My job flow is as follows: Grab the log data from S3 Either use spark dataframes or spark sql to parse the data and write back out to S3 Upload the data from S3 to Redshift. I'm getting hung up on how to automate this though so that my process spins up an EMR cluster, bootstraps the correct programs for installation, and runs my python script that will contain the code for parsing and

PySpark 2.0 The size or shape of a DataFrame

会有一股神秘感。 提交于 2020-06-24 03:03:30
问题 I am trying to find out the size/shape of a DataFrame in PySpark. I do not see a single function that can do this. In Python I can do data.shape() Is there a similar function in PySpark. This is my current solution, but I am looking for an element one row_number = data.count() column_number = len(data.dtypes) The computation of the number of columns is not ideal... 回答1: print((df.count(), len(df.columns))) 回答2: Use df.count() to get the number of rows. 回答3: Add this to the your code: def

DF.topandas() throwing error in pyspark

假装没事ソ 提交于 2020-06-23 15:59:54
问题 I am running a huge text file using PyCharm and PySpark. This is what I am trying to do: spark_home = os.environ.get('SPARK_HOME', None) os.environ["SPARK_HOME"] = "C:\spark-2.3.0-bin-hadoop2.7" import pyspark from pyspark import SparkContext, SparkConf from pyspark.sql import SparkSession conf = SparkConf() sc = SparkContext(conf=conf) spark = SparkSession.builder.config(conf=conf).getOrCreate() import pandas as pd ip = spark.read.format("csv").option("inferSchema","true").option("header",

How to see the dataframe in the console (equivalent of .show() for structured streaming)?

白昼怎懂夜的黑 提交于 2020-06-17 13:35:08
问题 I'm trying to see what's coming in as my DataFrame.. here is the spark code from pyspark.sql import SparkSession import pyspark.sql.functions as psf import logging import time spark = SparkSession \ .builder \ .appName("Console Example") \ .getOrCreate() logging.info("started to listen to the host..") lines = spark \ .readStream \ .format("socket") \ .option("host", "127.0.0.1") \ .option("port", 9999) \ .load() data = lines.selectExpr("CAST(value AS STRING)") query1 = data.writeStream.format

Pyspark to Netezza write issue : failed to create external table for bulk load

此生再无相见时 提交于 2020-06-17 13:26:13
问题 While writing from Pyspark to Netezza Im constantly getting following error (Issue persists when the size of the dataframe increases, no issues in appending for small dataframes with 10 to 60 rows) : org.netezza.error.NzSQLException: failed to create external table for bulk load at org.netezza.sql.NzPreparedStatament.executeBatch(NzPreparedStatament.java:1140) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:667) at org.apache.spark.sql.execution

How to create a Spark data frame from Pandas data frame using snow flake and python?

为君一笑 提交于 2020-06-17 13:17:07
问题 I have a sql which is stored in a variable in python and we use SnowFlake database. First I have converted to Pandas Data frame using sql, but I need to convert to Spark Data frame and then store in a CreateorReplaceTempView. I tried: import pandas as pd import sf_connectivity (we have a code for establishing connection with Snowflake database) emp = 'Select * From Employee' snowflake_connection = sf_connectivity.collector() (It is a method to establish snowflake conenction) pd_df = pd.read

Create PySpark dataframe : sequence of months with year

☆樱花仙子☆ 提交于 2020-06-17 13:02:06
问题 Complete newbie here. I would like to create a dataframe using pyspark that will list month and year taking the current date and listing x number of lines. if i decide x=5 dataframe should like as below Calendar_Entry August 2019<br/> September 2019<br/> October 2019<br/> November 2019<br/> December 2019 回答1: Spark is not a tool for generating rows in a distributed way but rather for processing then distributed. Since your data is small anyway the best solution is probably to create the data