spark-dataframe

Extract value from structure within an array of arrays in spark using scala

时光怂恿深爱的人放手 提交于 2019-12-10 22:49:58
问题 I am reading json data into spark data frame using scala. The schema is as follows: root |-- metadata: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- playerId: string (nullable = true) | | |-- sources: array (nullable = true) | | | |-- element: struct (containsNull = true) | | | | |-- matchId: long (nullable = true) The data looks as follows: { "metadata" : [ { "playerId" : "1234", "sources" : [ { "matchId": 1 } ] }, { "playerId": "1235", "sources": [ { "matchId":

How to create an dataframe from a dictionary where each item is a column in PySpark

我只是一个虾纸丫 提交于 2019-12-10 22:34:11
问题 I want to make a new dataframe from a dictionary. The dictionary contains column names as keys and lists of columnar data as values. For example: col_dict = {'col1': [1, 2, 3], 'col2': [4, 5, 6]} I need this as a dataframe that looks like this: +------+------+ | col1 | col2 | +------+------+ | 1| 4| | 2| 5| | 3| 6| +------+------+ It doesn't seem like there's an easy way to do this. 回答1: Easiest way is to create a pandas DataFrame and convert to a Spark DataFrame: With Pandas col_dict = {

spark Dataframe execute UPDATE statement

ⅰ亾dé卋堺 提交于 2019-12-10 22:10:45
问题 Hy guys, I need to perform jdbc operation using Apache Spark DataFrame. Basically I have an historical jdbc table called Measures where I have to do two operations: 1. Set endTime validity attribute of the old measure record to the current time 2. Insert a new measure record setting endTime to 9999-12-31 Can someone tell me how to perform (if we can) update statement for the first operation and insert for the second operation? I tried to use this statement for the first operation: val

spark inconsistency when running count command

时光怂恿深爱的人放手 提交于 2019-12-10 21:27:29
问题 A question about inconsistency of Spark calculations. Does this exist? For example, I am running EXACTLY the same command twice, e.g.: imp_sample.where(col("location").isNotNull()).count() And I am getting slightly different results every time I run it (141,830, then 142,314)! Or this: imp_sample.where(col("location").isNull()).count() and getting 2,587,013, and then 2,586,943. How is it even possible? Thank you! 回答1: As per your comment, you are using sampleBy in your pipeline. sampleBy

Spark giving Null Pointer Exception while performing jdbc save

放肆的年华 提交于 2019-12-10 20:59:41
问题 Hi I am getting the following stack trace when I execute the following lines of code: transactionDF.write.format("jdbc") .option("url",SqlServerUri) .option("driver", driver) .option("dbtable", fullQualifiedName) .option("user", SqlServerUser).option("password",SqlServerPassword) .mode(SaveMode.Append).save() The following is the stacktrace: at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_3$(Unknown Source) at org.apache.spark.sql.catalyst

Python Spark DataFrame: replace null with SparseVector

╄→尐↘猪︶ㄣ 提交于 2019-12-10 18:19:48
问题 In spark, I have following data frame called "df" with some null entries: +-------+--------------------+--------------------+ | id| features1| features2| +-------+--------------------+--------------------+ | 185|(5,[0,1,4],[0.1,0...| null| | 220|(5,[0,2,3],[0.1,0...|(10,[1,2,6],[0.1,...| | 225| null|(10,[1,3,5],[0.1,...| +-------+--------------------+--------------------+ df.features1 and df.features2 are type vector (nullable). Then I tried to use following code to fill null entries with

How to do a self join in Spark 2.3.0? What is the correct syntax?

余生颓废 提交于 2019-12-10 17:38:23
问题 I have the following code import org.apache.spark.sql.streaming.Trigger val jdf = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe", "join_test").option("startingOffsets", "earliest").load(); jdf.createOrReplaceTempView("table") val resultdf = spark.sql("select * from table as x inner join table as y on x.offset=y.offset") resultdf.writeStream.outputMode("append").format("console").option("truncate", false).trigger(Trigger.ProcessingTime

Spark 2.2 cannot write df to parquet

佐手、 提交于 2019-12-10 17:32:54
问题 I'm building a clustering algorithm and I need to store the model for future loading. I have a dataframe with this schema: val schema = new StructType() .add(StructField("uniqueId", LongType)) .add(StructField("timestamp", LongType)) .add(StructField("pt", ArrayType(DoubleType))) .add(StructField("norm", DoubleType)) .add(StructField("kNN", ArrayType(LongType))) .add(StructField("kDist", DoubleType)) .add(StructField("lrd", DoubleType)) .add(StructField("lof", DoubleType)) .add(StructField(

Different floating point precision from RDD and DataFrame

最后都变了- 提交于 2019-12-10 17:28:02
问题 I changed an RDD to DataFrame and compared the results with another DataFrame which I imported using read.csv but the floating point precision is not the same from the two approaches. I appreciate your help. The data I am using is from here. from pyspark.sql import Row from pyspark.sql.types import * RDD way orders = sc.textFile("retail_db/orders") order_items = sc.textFile('retail_db/order_items') orders_comp = orders.filter(lambda line: ((line.split(',')[-1] == 'CLOSED') or (line.split(',')

Spark Dataframe - Best way to cogroup dataframes

◇◆丶佛笑我妖孽 提交于 2019-12-10 16:53:50
问题 I currently load CSV files into Dataframes using the databricks library. I'm looking for the best generic approach to cogroup my loaded dataframes using a specific key since the cogroup operation is only available for PairRDDs. I found this post which implements a cogroup feature for Dataframes but I guess there are some different approaches : https://gist.github.com/ahoy-jon/b65754cde98cc48b9b38 Have you please ever faced this situation ? Thanks. 来源: https://stackoverflow.com/questions