pyspark | 易学教程

How to load big double numbers in a PySpark DataFrame and persist it back without changing the numeric format to scientific notation or precision?

阅读更多关于 How to load big double numbers in a PySpark DataFrame and persist it back without changing the numeric format to scientific notation or precision?

问题 I have a CSV like that: COL,VAL TEST,100000000.12345679 TEST2,200000000.1234 TEST3,9999.1234679123 I want to load it having the column VAL as a numeric type (due to other requirements of the project) and then persist it back to another CSV as per structure below: +-----+------------------+ | COL| VAL| +-----+------------------+ | TEST|100000000.12345679| |TEST2| 200000000.1234| |TEST3| 9999.1234679123| +-----+------------------+ The problem I'm facing is that whenever I load it, the numbers

Spark org.apache.http.ConnectionClosedException when calling .show() and .toPandas() with an S3 dataframe

阅读更多关于 Spark org.apache.http.ConnectionClosedException when calling .show() and .toPandas() with an S3 dataframe

问题 I created a PySpark DataFrame df with Parquet data on AWS S3. Calling df.count() works, but df.show() or df.toPandas() fails with the following error: Py4JJavaError: An error occurred while calling o41.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8.0 (TID 14, 10.20.202.97, executor driver): org.apache.http.ConnectionClosedException: Premature end of Content- Length delimited

Spark org.apache.http.ConnectionClosedException when calling .show() and .toPandas() with an S3 dataframe

阅读更多关于 Spark org.apache.http.ConnectionClosedException when calling .show() and .toPandas() with an S3 dataframe

Load dataframe from pyspark

阅读更多关于 Load dataframe from pyspark

问题 I am trying to connect to MS SQL DB from PySpark using spark.read.jdbc import os from pyspark.sql import * from pyspark.sql.functions import * from pyspark import SparkContext; from pyspark.sql.session import SparkSession sc = SparkContext.getOrCreate() spark = SparkSession(sc) df = spark.read \ .format('jdbc') \ .option('url', 'jdbc:sqlserver://local:1433') \ .option('user', 'sa') \ .option('password', '12345') \ .option('dbtable', '(select COL1, COL2 from tbl1 WHERE COL1 = 2)') then I do df

Load dataframe from pyspark

阅读更多关于 Load dataframe from pyspark

Writing custom condition inside .withColumn in Pyspark

阅读更多关于 Writing custom condition inside .withColumn in Pyspark

问题 I have to add a customized condition, which has many columns in .withColumn. My scenario is somewhat like this. I have to check many columns row wise if they have Null values, and add those column names to a new column. My code looks somewhat like this: df= df.withColumn("MissingColumns",\ array(\ when(col("firstName").isNull(),lit("firstName")),\ when(col("salary").isNull(),lit("salary")))) Problem is I have many columns which I have to add to the condition. So I tried to customize it using

Writing custom condition inside .withColumn in Pyspark

阅读更多关于 Writing custom condition inside .withColumn in Pyspark

Error when installing python-snappy in PyCharm

阅读更多关于 Error when installing python-snappy in PyCharm

问题 I have a '.snappy.parquet' file and I wanted to view the content in this file, I know I can use pandas and PySpark. This is beyond my knowledge, I'm not sure what to do, can someone help me please... I've been struggling for almost a day now.... Many thanks. (and if I can't fix this issue, do I have other options to convert this file to a readable file?) 回答1: This issue has been solved by using the approach here: Can't install python-snappy wheel in Pycharm 回答2: You need snappy library

Error when installing python-snappy in PyCharm

阅读更多关于 Error when installing python-snappy in PyCharm

pySpark withColumn with a function

阅读更多关于 pySpark withColumn with a function

问题 I have a dataframe which has 2 columns: account_id and email_address, now I want to add one more column 'updated_email_address' which i call some function on email_address to get the updated_email_address. here is my code: def update_email(email): print("== email to be updated: " + email) today = datetime.date.today() updated = substring(email, -8, 8) + str(today.strftime('%m')) + str(today.strftime('%d')) + "_updated" return updated df.withColumn('updated_email_address', update_email(df