spark-dataframe | 易学教程

Extract value from structure within an array of arrays in spark using scala

阅读更多关于 Extract value from structure within an array of arrays in spark using scala

问题 I am reading json data into spark data frame using scala. The schema is as follows: root |-- metadata: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- playerId: string (nullable = true) | | |-- sources: array (nullable = true) | | | |-- element: struct (containsNull = true) | | | | |-- matchId: long (nullable = true) The data looks as follows: { "metadata" : [ { "playerId" : "1234", "sources" : [ { "matchId": 1 } ] }, { "playerId": "1235", "sources": [ { "matchId":

How to create an dataframe from a dictionary where each item is a column in PySpark

阅读更多关于 How to create an dataframe from a dictionary where each item is a column in PySpark

问题 I want to make a new dataframe from a dictionary. The dictionary contains column names as keys and lists of columnar data as values. For example: col_dict = {'col1': [1, 2, 3], 'col2': [4, 5, 6]} I need this as a dataframe that looks like this: +------+------+ | col1 | col2 | +------+------+ | 1| 4| | 2| 5| | 3| 6| +------+------+ It doesn't seem like there's an easy way to do this. 回答1: Easiest way is to create a pandas DataFrame and convert to a Spark DataFrame: With Pandas col_dict = {

spark Dataframe execute UPDATE statement

阅读更多关于 spark Dataframe execute UPDATE statement

问题 Hy guys, I need to perform jdbc operation using Apache Spark DataFrame. Basically I have an historical jdbc table called Measures where I have to do two operations: 1. Set endTime validity attribute of the old measure record to the current time 2. Insert a new measure record setting endTime to 9999-12-31 Can someone tell me how to perform (if we can) update statement for the first operation and insert for the second operation? I tried to use this statement for the first operation: val

spark inconsistency when running count command

阅读更多关于 spark inconsistency when running count command

问题 A question about inconsistency of Spark calculations. Does this exist? For example, I am running EXACTLY the same command twice, e.g.: imp_sample.where(col("location").isNotNull()).count() And I am getting slightly different results every time I run it (141,830, then 142,314)! Or this: imp_sample.where(col("location").isNull()).count() and getting 2,587,013, and then 2,586,943. How is it even possible? Thank you! 回答1: As per your comment, you are using sampleBy in your pipeline. sampleBy

Spark giving Null Pointer Exception while performing jdbc save

阅读更多关于 Spark giving Null Pointer Exception while performing jdbc save

问题 Hi I am getting the following stack trace when I execute the following lines of code: transactionDF.write.format("jdbc") .option("url",SqlServerUri) .option("driver", driver) .option("dbtable", fullQualifiedName) .option("user", SqlServerUser).option("password",SqlServerPassword) .mode(SaveMode.Append).save() The following is the stacktrace: at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_3$(Unknown Source) at org.apache.spark.sql.catalyst

Python Spark DataFrame: replace null with SparseVector

阅读更多关于 Python Spark DataFrame: replace null with SparseVector

问题 In spark, I have following data frame called "df" with some null entries: +-------+--------------------+--------------------+ | id| features1| features2| +-------+--------------------+--------------------+ | 185|(5,[0,1,4],[0.1,0...| null| | 220|(5,[0,2,3],[0.1,0...|(10,[1,2,6],[0.1,...| | 225| null|(10,[1,3,5],[0.1,...| +-------+--------------------+--------------------+ df.features1 and df.features2 are type vector (nullable). Then I tried to use following code to fill null entries with

How to do a self join in Spark 2.3.0? What is the correct syntax?

阅读更多关于 How to do a self join in Spark 2.3.0? What is the correct syntax?

问题 I have the following code import org.apache.spark.sql.streaming.Trigger val jdf = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe", "join_test").option("startingOffsets", "earliest").load(); jdf.createOrReplaceTempView("table") val resultdf = spark.sql("select * from table as x inner join table as y on x.offset=y.offset") resultdf.writeStream.outputMode("append").format("console").option("truncate", false).trigger(Trigger.ProcessingTime

Spark 2.2 cannot write df to parquet

阅读更多关于 Spark 2.2 cannot write df to parquet

问题 I'm building a clustering algorithm and I need to store the model for future loading. I have a dataframe with this schema: val schema = new StructType() .add(StructField("uniqueId", LongType)) .add(StructField("timestamp", LongType)) .add(StructField("pt", ArrayType(DoubleType))) .add(StructField("norm", DoubleType)) .add(StructField("kNN", ArrayType(LongType))) .add(StructField("kDist", DoubleType)) .add(StructField("lrd", DoubleType)) .add(StructField("lof", DoubleType)) .add(StructField(

Different floating point precision from RDD and DataFrame

阅读更多关于 Different floating point precision from RDD and DataFrame

问题 I changed an RDD to DataFrame and compared the results with another DataFrame which I imported using read.csv but the floating point precision is not the same from the two approaches. I appreciate your help. The data I am using is from here. from pyspark.sql import Row from pyspark.sql.types import * RDD way orders = sc.textFile("retail_db/orders") order_items = sc.textFile('retail_db/order_items') orders_comp = orders.filter(lambda line: ((line.split(',')[-1] == 'CLOSED') or (line.split(',')

Spark Dataframe - Best way to cogroup dataframes

阅读更多关于 Spark Dataframe - Best way to cogroup dataframes

问题 I currently load CSV files into Dataframes using the databricks library. I'm looking for the best generic approach to cogroup my loaded dataframes using a specific key since the cogroup operation is only available for PairRDDs. I found this post which implements a cogroup feature for Dataframes but I guess there are some different approaches : https://gist.github.com/ahoy-jon/b65754cde98cc48b9b38 Have you please ever faced this situation ? Thanks. 来源： https://stackoverflow.com/questions