pyspark

How to load big double numbers in a PySpark DataFrame and persist it back without changing the numeric format to scientific notation or precision?

五迷三道 提交于 2020-12-15 07:18:10
问题 I have a CSV like that: COL,VAL TEST,100000000.12345679 TEST2,200000000.1234 TEST3,9999.1234679123 I want to load it having the column VAL as a numeric type (due to other requirements of the project) and then persist it back to another CSV as per structure below: +-----+------------------+ | COL| VAL| +-----+------------------+ | TEST|100000000.12345679| |TEST2| 200000000.1234| |TEST3| 9999.1234679123| +-----+------------------+ The problem I'm facing is that whenever I load it, the numbers

Spark org.apache.http.ConnectionClosedException when calling .show() and .toPandas() with an S3 dataframe

不打扰是莪最后的温柔 提交于 2020-12-15 06:39:45
问题 I created a PySpark DataFrame df with Parquet data on AWS S3. Calling df.count() works, but df.show() or df.toPandas() fails with the following error: Py4JJavaError: An error occurred while calling o41.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 (TID 14, 10.20.202.97, executor driver): org.apache.http.ConnectionClosedException: Premature end of Content- Length delimited

Spark org.apache.http.ConnectionClosedException when calling .show() and .toPandas() with an S3 dataframe

送分小仙女□ 提交于 2020-12-15 06:39:30
问题 I created a PySpark DataFrame df with Parquet data on AWS S3. Calling df.count() works, but df.show() or df.toPandas() fails with the following error: Py4JJavaError: An error occurred while calling o41.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 (TID 14, 10.20.202.97, executor driver): org.apache.http.ConnectionClosedException: Premature end of Content- Length delimited

Load dataframe from pyspark

倾然丶 夕夏残阳落幕 提交于 2020-12-15 05:23:56
问题 I am trying to connect to MS SQL DB from PySpark using spark.read.jdbc import os from pyspark.sql import * from pyspark.sql.functions import * from pyspark import SparkContext; from pyspark.sql.session import SparkSession sc = SparkContext.getOrCreate() spark = SparkSession(sc) df = spark.read \ .format('jdbc') \ .option('url', 'jdbc:sqlserver://local:1433') \ .option('user', 'sa') \ .option('password', '12345') \ .option('dbtable', '(select COL1, COL2 from tbl1 WHERE COL1 = 2)') then I do df

Load dataframe from pyspark

好久不见. 提交于 2020-12-15 05:23:06
问题 I am trying to connect to MS SQL DB from PySpark using spark.read.jdbc import os from pyspark.sql import * from pyspark.sql.functions import * from pyspark import SparkContext; from pyspark.sql.session import SparkSession sc = SparkContext.getOrCreate() spark = SparkSession(sc) df = spark.read \ .format('jdbc') \ .option('url', 'jdbc:sqlserver://local:1433') \ .option('user', 'sa') \ .option('password', '12345') \ .option('dbtable', '(select COL1, COL2 from tbl1 WHERE COL1 = 2)') then I do df

Writing custom condition inside .withColumn in Pyspark

巧了我就是萌 提交于 2020-12-15 03:39:51
问题 I have to add a customized condition, which has many columns in .withColumn. My scenario is somewhat like this. I have to check many columns row wise if they have Null values, and add those column names to a new column. My code looks somewhat like this: df= df.withColumn("MissingColumns",\ array(\ when(col("firstName").isNull(),lit("firstName")),\ when(col("salary").isNull(),lit("salary")))) Problem is I have many columns which I have to add to the condition. So I tried to customize it using

Writing custom condition inside .withColumn in Pyspark

江枫思渺然 提交于 2020-12-15 03:38:21
问题 I have to add a customized condition, which has many columns in .withColumn. My scenario is somewhat like this. I have to check many columns row wise if they have Null values, and add those column names to a new column. My code looks somewhat like this: df= df.withColumn("MissingColumns",\ array(\ when(col("firstName").isNull(),lit("firstName")),\ when(col("salary").isNull(),lit("salary")))) Problem is I have many columns which I have to add to the condition. So I tried to customize it using

Error when installing python-snappy in PyCharm

柔情痞子 提交于 2020-12-13 21:04:45
问题 I have a '.snappy.parquet' file and I wanted to view the content in this file, I know I can use pandas and PySpark. This is beyond my knowledge, I'm not sure what to do, can someone help me please... I've been struggling for almost a day now.... Many thanks. (and if I can't fix this issue, do I have other options to convert this file to a readable file?) 回答1: This issue has been solved by using the approach here: Can't install python-snappy wheel in Pycharm 回答2: You need snappy library

Error when installing python-snappy in PyCharm

泪湿孤枕 提交于 2020-12-13 21:00:22
问题 I have a '.snappy.parquet' file and I wanted to view the content in this file, I know I can use pandas and PySpark. This is beyond my knowledge, I'm not sure what to do, can someone help me please... I've been struggling for almost a day now.... Many thanks. (and if I can't fix this issue, do I have other options to convert this file to a readable file?) 回答1: This issue has been solved by using the approach here: Can't install python-snappy wheel in Pycharm 回答2: You need snappy library

pySpark withColumn with a function

社会主义新天地 提交于 2020-12-13 18:49:53
问题 I have a dataframe which has 2 columns: account_id and email_address, now I want to add one more column 'updated_email_address' which i call some function on email_address to get the updated_email_address. here is my code: def update_email(email): print("== email to be updated: " + email) today = datetime.date.today() updated = substring(email, -8, 8) + str(today.strftime('%m')) + str(today.strftime('%d')) + "_updated" return updated df.withColumn('updated_email_address', update_email(df