apache-spark

Working with a StructType column in PySpark UDF

老子叫甜甜 提交于 2021-02-11 15:02:32
问题 I have the following schema for one of columns that I'm processing, |-- time_to_resolution_remainingTime: struct (nullable = true) | |-- _links: struct (nullable = true) | | |-- self: string (nullable = true) | |-- completedCycles: array (nullable = true) | | |-- element: struct (containsNull = true) | | | |-- breached: boolean (nullable = true) | | | |-- elapsedTime: struct (nullable = true) | | | | |-- friendly: string (nullable = true) | | | | |-- millis: long (nullable = true) | | | |--

Pre-partition data in spark such that each partition has non-overlapping values in the column we are partitioning on

99封情书 提交于 2021-02-11 15:01:23
问题 I'm trying to pre-partition the data before doing an aggregation operation across a certain column of my data. I have 3 worker nodes and I would llike each partition to have non-overlapping values in the column I am partitioning on. I don't want to have situations where two partitions might have the same values in the column. e.g. If I have the following data ss_item_sk | ss_quantity 1 | 10.0 1 | 4.0 2 | 3.0 3 | 5.0 4 | 8.0 5 | 13.0 5 | 10.0 Then the following partitions are satisfactory:

Pyspark EMR Conda issue

梦想与她 提交于 2021-02-11 14:46:28
问题 i am trying to run a spark script on EMR with custom conda env, 1. created a booststap for conda setup and supplied to the EMR, i don't see any issues with bootstrap but when i do spark-submit it gives me same error no sure what am i missing Traceback (most recent call last): File "/mnt/tmp/spark-b334133c-d22d-42d4-beba-b85fffbbc9c7/iris_cube_analysis.py", line 3, in <module> import iris ImportError: No module named iris spark-submit - spark-submit --deploy-mode client --master yarn --conf

Spark, how to print the query?

纵然是瞬间 提交于 2021-02-11 14:36:02
问题 I'm using pyspark df = self.sqlContext.read.option( "es.resource", indexes ).format("org.elasticsearch.spark.sql").load() df = df.filter( df.data.timestamp >= self.period_start ) I'd like to see sql query version of df if possible. something like print(df.query) to see something like select * from my-indexes where data.timestamp > self.period_start 回答1: You can check out this piece of documentation for pyspark.sql.DataFrame.explain . explain prints the (logical and physical) plan to the

Writing to a PostgreSQL [CIDR] column with Spark JDBC

不羁岁月 提交于 2021-02-11 14:34:14
问题 I'm trying to write a Spark 2.4.4 dataframe to PostgreSQL via JDBC. I'm using Scala. batchDF. write. format("jdbc"). option("url", "jdbc:postgresql://..."). option("driver", "org.postgresql.Driver"). option("dbtable", "traffic_info"). option("user", "xxxx"). option("password", "xxxx"). mode(SaveMode.Append). save() One of the fields ( remote_prefix ) is of CIDR type in my table but is StringType in my dataframe, so I cannot write it as-is: ERROR: column "remote_prefix" is of type cidr but

spark sql lag, result gets different rows when I change column

允我心安 提交于 2021-02-11 14:32:41
问题 I'm trying to lag a field when it matches certain conditions, and because I need to use filters, I'm using the MAX function to lag it, as the LAG function itself doesn't work the way I need it. I have been able to do it with the code below for the ID_EVENT_LOG , but when I change the ID_EVENT_LOG inside the MAX to the column ENSAIO , so I would lag the column ENSAIO it doesn't work properly. Example below. Dataset: +------------+---------+------+ |ID_EVENT_LOG|ID_PAINEL|ENSAIO| +------------+

Spark Scala Checkpointing Data Set showing .isCheckpointed = false after Action but directories written

元气小坏坏 提交于 2021-02-11 14:21:49
问题 There seem to be a few postings on this but none seem to answer what I understand. The following code run on DataBricks: spark.sparkContext.setCheckpointDir("/dbfs/FileStore/checkpoint/cp1/loc7") val checkpointDir = spark.sparkContext.getCheckpointDir.get val ds = spark.range(10).repartition(2) ds.cache() ds.checkpoint() ds.count() ds.rdd.isCheckpointed Added an improvement of sorts: ... val ds2 = ds.checkpoint(eager=true) println(ds2.queryExecution.toRdd.toDebugString) ... returns: (2)

Spark streaming - consuming message from socket and processing: Null Pointer Exception

大城市里の小女人 提交于 2021-02-11 14:21:48
问题 Need the message from the socket using spark streaming and read the file from filepath specified in the message and write to the dst. Message from socket : {"fileName" : "sampleFile.dat","filePath":"/Users/Desktop/test/abc1.dat","fileDst":"/Users/Desktop/git/spark-streaming-poc/src/main/resourcs/samplefile2"} Error: java.lang.NullPointerException at org.apache.spark.sql.execution.SparkPlan.sparkContext(SparkPlan.scala:56) at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.metrics

How to subtract vector from scalar in scala?

与世无争的帅哥 提交于 2021-02-11 14:20:00
问题 I have parquet file which contain two columns (id,features).I want to subtract features from scalar and divide output by another scalar. parquet file df.withColumn("features", ((df("features")-constant1)/constant2)) but give me error requirement failed: The number of columns doesn't match. Old column names (2): id, features New column names (1): features How to solve it? 回答1: My scala spark code to this as below . Only way to do any operation on vector sparkm datatype is casting to string.

How to Read a parquet file , change datatype and write to another Parquet file in Hadoop using pyspark

爱⌒轻易说出口 提交于 2021-02-11 14:10:27
问题 My source parquet file has everything as string. My destination parquet file needs to convert this to different datatype like int, string, date etc. How do I do this? 回答1: you may wanted to apply userdefined schema to speedup data loading. There are 2 ways to apply that- using the input DDL-formatted string spark.read.schema("a INT, b STRING, c DOUBLE").parquet("test.parquet") Use StructType schema customSchema = StructType([ StructField("a", IntegerType(), True), StructField("b", StringType(