spark-dataframe

Spark: Difference between numPartitions in read.jdbc(..numPartitions..) and repartition(..numPartitions..)

十年热恋 提交于 2019-12-11 11:48:26
问题 I'm perplexed between the behaviour of numPartitions parameter in the following methods: DataFrameReader.jdbc Dataset.repartition The official docs of DataFrameReader.jdbc say following regarding numPartitions parameter numPartitions : the number of partitions. This, along with lowerBound (inclusive), upperBound (exclusive), form partition strides for generated WHERE clause expressions used to split the column columnName evenly. And official docs of Dataset.repartition say Returns a new

efficiently calculating connected components in pyspark

只谈情不闲聊 提交于 2019-12-11 11:02:36
问题 I'm trying to find the connected components for friends in a city. My data is a list of edges with an attribute of city. City | SRC | DEST Houston Kyle -> Benny Houston Benny -> Charles Houston Charles -> Denny Omaha Carol -> Brian etc. I know the connectedComponents function of pyspark's GraphX library will iterate over all the edges of a graph to find the connected components and I'd like to avoid that. How would I do so? edit: I thought I could do something like select connected_components

Pyspark Dataframe: Get previous row that meets a condition

只愿长相守 提交于 2019-12-11 09:24:27
问题 For every row in a PySpark DataFrame I am trying to get a value from the first preceding row that satisfied a certain condition: That is if my dataframe looks like this: X | Flag 1 | 1 2 | 0 3 | 0 4 | 0 5 | 1 6 | 0 7 | 0 8 | 0 9 | 1 10 | 0 I want output that looks like this: X | Lag_X | Flag 1 | NULL | 1 2 | 1 | 0 3 | 1 | 0 4 | 1 | 0 5 | 1 | 1 6 | 5 | 0 7 | 5 | 0 8 | 5 | 0 9 | 5 | 1 10 | 9 | 0 I thought I could do this with lag function and a WindowSpec, unfortunately WindowSpec doesnt

How to perform join on MySQL (JDBC) with Spark?

梦想的初衷 提交于 2019-12-11 08:12:07
问题 I would like to read data from MySQL through Spark. The API which I saw is able to read data from specific table. something like, val prop = new java.util.Properties prop.setProperty("user", "<username>") prop.setProperty("password", "<password>") sparkSession.read.jdbc("jdbc:mysql://????:3306/???", "some-table", prop) Now, I would like to perform a query for join tables. Does anyone know how to do it (on the database side, not with Spark SQL) ? Thanks, Eran 回答1: You'll need to use the "table

scala - Resultset to spark Dataframe

我怕爱的太早我们不能终老 提交于 2019-12-11 07:59:11
问题 i am querying mysql table val url = "jdbc:mysql://XXX-XX-XXX-XX-XX.compute-1.amazonaws.com:3306/pg_partner" val driver = "com.mysql.jdbc.Driver" val username = "XXX" val password = "XXX" var connection:Connection = DriverManager.getConnection(url, username, password) val statement = connection.createStatement() val patnerName = statement.executeQuery("SELECT id,name FROM partner") i do get my result in patnerName but i need to be converted to Dataframe. i am able to print data by below code:

Spark: PartitionBy, change output file name

耗尽温柔 提交于 2019-12-11 07:26:38
问题 Currently , when I use the paritionBy to write to HDFS: DF.write.partitionBy("id") I will get output structure looking like (which is the default behaviour) ../id=1/ ../id=2/ ../id=3/ I would like a structure looking like: ../a/ ../b/ ../c/ such that if id = 1, then a if id = 2, then b .. etc Is there a way to change the filename output? If not What is the best way to do this? 回答1: You won't be able to use Spark's partitionBy to achieve this. Instead, you have to break your DataFrame into its

Loop through dataframe and update the lookup table simultaneously: spark scala

我的未来我决定 提交于 2019-12-11 07:24:25
问题 I have a DataFrame like the following. +---+-------------+-----+ | id|AccountNumber|scale| +---+-------------+-----+ | 1| 1500847| 6| | 2| 1501199| 7| | 3| 1119024| 3| +---+-------------+-----+ I have to populate a second DataFrame , which would initially be empty, as follows. id AccountNumber scale 1 1500847 6 2 1501199 6 3 1119024 3 Output explaination First row in the first DataFrame has a scale of 6. Check for that value minus 1 (so scale equals 5) in the result. There none, so simply add

how to get input file name of a record in spark dataframe?

三世轮回 提交于 2019-12-11 05:47:52
问题 I am creating a dataframe in spark by loading tab separated files from s3. I need to get the input file name information of each record in the dataframe for further processing. I tried dataframe.select(inputFileName()) But I am getting null value for input_file_name. somebody please help me to solve this issue. 回答1: You can create a new column on the data frame using withColumn and input_file_name() : dataframe.withColumn("input_file", input_file_name()) 来源: https://stackoverflow.com

How to get count of invalid data during parse

ぐ巨炮叔叔 提交于 2019-12-11 05:28:53
问题 We are using spark to parse a big csv file, which may contain invalid data. We want to save valid data into the data store, and also return how many valid data we imported and how many invalid data. I am wondering how we can do this in spark, what's the standard approach when reading data? My current approach uses Accumulator , but it's not accurate due to how Accumulator works in spark. // we define case class CSVInputData: all fields are defined as string val csvInput = spark.read.option(

Spark Data Frame write to parquet table - slow at updating partition stats

自古美人都是妖i 提交于 2019-12-11 04:45:12
问题 When I write data from dataframe into parquet table ( which is partitioned ) after all the tasks are successful, process is stuck at updating partition stats. 16/10/05 03:46:13 WARN log: Updating partition stats fast for: 16/10/05 03:46:14 WARN log: Updated size to 143452576 16/10/05 03:48:30 WARN log: Updating partition stats fast for: 16/10/05 03:48:31 WARN log: Updated size to 147382813 16/10/05 03:51:02 WARN log: Updating partition stats fast for: df.write.format("parquet").mode(