pyspark | 易学教程

Implicit schema for pandas_udf in PySpark?

阅读更多关于 Implicit schema for pandas_udf in PySpark?

问题 This answer nicely explains how to use pyspark's groupby and pandas_udf to do custom aggregations. However, I cannot possibly declare my schema manually as shown in this part of the example from pyspark.sql.types import * schema = StructType([ StructField("key", StringType()), StructField("avg_min", DoubleType()) ]) since I will be returning 100+ columns with names that are automatically generated. Is there any way to tell PySpark to just implicetely use the Schema returned by my function and

PySpark UDF optimization challenge

阅读更多关于 PySpark UDF optimization challenge

问题 I am trying to optimize the code below. The when run with 1000 lines of data takes about 12 minutes to complete. Our use case would require data sizes to be around 25K - 50K rows which would make this implementation completely infeasible. import pyspark.sql.types as Types import numpy import spacy from pyspark.sql.functions import udf inputPath = "s3://myData/part-*.parquet" df = spark.read.parquet(inputPath) test_df = df.select('uid', 'content').limit(1000).repartition(10) # print(df.rdd

PYSPARK: CX_ORACLE.InterfaceError: not a query

阅读更多关于 PYSPARK: CX_ORACLE.InterfaceError: not a query

问题 i need to perform update query in spark job. i am trying below code. but facing issues. import cx_Oracle def query(sql): connection = cx_Oracle.connect("username/password@s<url>/db") cursor = connection.cursor() cursor.execute(sql) result = cursor.fetchall() return result v = [10] rdd = sc.parallelize(v).coalesce(1) rdd.foreachPartition(lambda x : [query("UPDATE db.tableSET MAPPERS ="+str(i)+" WHERE TABLE_NAME = 'table_name'") for i in x]) when i execute the above process i am getting below

Matplotlib in Jupyter results in variable “is not defined”

阅读更多关于 Matplotlib in Jupyter results in variable “is not defined”

问题 I'm having a strange issue using Jupyter to plot some simple data. There is a lot of nuance to my specific use-case, not the least of which is a Jupyter notebook connected to our cloud-based Spark cluster with a PySpark kernel. I can't, for the life of me, figure out why this simple code will not run without error. In reality I have to have the code set up like this, because instead of "x" and "y" I'm dealing with a data frame sourced from a Hive query - using the %sql magic and manipulating

Azure Databricks to Azure SQL DW: Long text columns

阅读更多关于 Azure Databricks to Azure SQL DW: Long text columns

问题 I would like to populate an Azure SQL DW from an Azure Databricks notebook environment. I am using the built-in connector with pyspark: sdf.write \ .format("com.databricks.spark.sqldw") \ .option("forwardSparkAzureStorageCredentials", "true") \ .option("dbTable", "test_table") \ .option("url", url) \ .option("tempDir", temp_dir) \ .save() This works fine, but I get an error when I include a string column with a sufficiently long content. I get the following error: Py4JJavaError: An error

How can see the SQL statements that SPARK sends to my database?

阅读更多关于 How can see the SQL statements that SPARK sends to my database?

问题 I have a spark cluster and a vertica database. I use spark.read.jdbc( # etc to load Spark dataframes into the cluster. When I do a certain groupby function df2 = df.groupby('factor').agg(F.stddev('sum(PnL)')) df2.show() I then get a vertica syntax exception Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler

GCP Dataproc custom image Python environment

阅读更多关于 GCP Dataproc custom image Python environment

问题 I have an issue when I create a DataProc custom image and Pyspark. My custom image is based on DataProc 1.4.1-debian9 and with my initialisation script I install python3 and some packages from a requirements.txt file, then set the python3 env variable to force pyspark to use python3. But when I submit a job on a cluster created (with single node flag for simplicity) with this image, the job can't find the packages installed. If I log on the cluster machine and run the pyspark command, starts

Pyspark simple re-partition and toPandas() fails to finish on just 600,000+ rows

阅读更多关于 Pyspark simple re-partition and toPandas() fails to finish on just 600,000+ rows

问题 I have JSON data that I am reading into a data frame with several fields, repartitioning it based on two columns, and converting to Pandas. This job keeps failing on EMR on just 600,000 rows of data with some obscure errors. I have also increased memory settings of the spark driver, and still don't see any resolution. Here is my pyspark code: enhDataDf = ( sqlContext .read.json(sys.argv[1]) ) enhDataDf = ( enhDataDf .repartition('column1', 'column2') .toPandas() ) enhDataDf = sqlContext

Pyspark simple re-partition and toPandas() fails to finish on just 600,000+ rows

阅读更多关于 Pyspark simple re-partition and toPandas() fails to finish on just 600,000+ rows

Calling scala code in pyspark for XSLT transformations

阅读更多关于 Calling scala code in pyspark for XSLT transformations

问题 This might be a long shot, but figured it couldn't hurt to ask. I'm attempting to use Elsevier's open-sourced spark-xml-utils package in pyspark to transform some XML records with XSLT. I've had a bit of success with some exploratory code getting a transformation to work: # open XSLT processor from spark's jvm context with open('/tmp/foo.xsl', 'r') as f: proc = sc._jvm.com.elsevier.spark_xml_utils.xslt.XSLTProcessor.getInstance(f.read()) # transform XML record with 'proc' with open('/tmp/bar