spark-dataframe | 易学教程

Spark: Difference between numPartitions in read.jdbc(..numPartitions..) and repartition(..numPartitions..)

阅读更多关于 Spark: Difference between numPartitions in read.jdbc(..numPartitions..) and repartition(..numPartitions..)

问题 I'm perplexed between the behaviour of numPartitions parameter in the following methods: DataFrameReader.jdbc Dataset.repartition The official docs of DataFrameReader.jdbc say following regarding numPartitions parameter numPartitions : the number of partitions. This, along with lowerBound (inclusive), upperBound (exclusive), form partition strides for generated WHERE clause expressions used to split the column columnName evenly. And official docs of Dataset.repartition say Returns a new

efficiently calculating connected components in pyspark

阅读更多关于 efficiently calculating connected components in pyspark

问题 I'm trying to find the connected components for friends in a city. My data is a list of edges with an attribute of city. City | SRC | DEST Houston Kyle -> Benny Houston Benny -> Charles Houston Charles -> Denny Omaha Carol -> Brian etc. I know the connectedComponents function of pyspark's GraphX library will iterate over all the edges of a graph to find the connected components and I'd like to avoid that. How would I do so? edit: I thought I could do something like select connected_components

Pyspark Dataframe: Get previous row that meets a condition

阅读更多关于 Pyspark Dataframe: Get previous row that meets a condition

问题 For every row in a PySpark DataFrame I am trying to get a value from the first preceding row that satisfied a certain condition: That is if my dataframe looks like this: X | Flag 1 | 1 2 | 0 3 | 0 4 | 0 5 | 1 6 | 0 7 | 0 8 | 0 9 | 1 10 | 0 I want output that looks like this: X | Lag_X | Flag 1 | NULL | 1 2 | 1 | 0 3 | 1 | 0 4 | 1 | 0 5 | 1 | 1 6 | 5 | 0 7 | 5 | 0 8 | 5 | 0 9 | 5 | 1 10 | 9 | 0 I thought I could do this with lag function and a WindowSpec, unfortunately WindowSpec doesnt

How to perform join on MySQL (JDBC) with Spark?

阅读更多关于 How to perform join on MySQL (JDBC) with Spark?

问题 I would like to read data from MySQL through Spark. The API which I saw is able to read data from specific table. something like, val prop = new java.util.Properties prop.setProperty("user", "<username>") prop.setProperty("password", "<password>") sparkSession.read.jdbc("jdbc:mysql://????:3306/???", "some-table", prop) Now, I would like to perform a query for join tables. Does anyone know how to do it (on the database side, not with Spark SQL) ? Thanks, Eran 回答1: You'll need to use the "table

scala - Resultset to spark Dataframe

阅读更多关于 scala - Resultset to spark Dataframe

问题 i am querying mysql table val url = "jdbc:mysql://XXX-XX-XXX-XX-XX.compute-1.amazonaws.com:3306/pg_partner" val driver = "com.mysql.jdbc.Driver" val username = "XXX" val password = "XXX" var connection:Connection = DriverManager.getConnection(url, username, password) val statement = connection.createStatement() val patnerName = statement.executeQuery("SELECT id,name FROM partner") i do get my result in patnerName but i need to be converted to Dataframe. i am able to print data by below code:

Spark: PartitionBy, change output file name

阅读更多关于 Spark: PartitionBy, change output file name

问题 Currently , when I use the paritionBy to write to HDFS: DF.write.partitionBy("id") I will get output structure looking like (which is the default behaviour) ../id=1/ ../id=2/ ../id=3/ I would like a structure looking like: ../a/ ../b/ ../c/ such that if id = 1, then a if id = 2, then b .. etc Is there a way to change the filename output? If not What is the best way to do this? 回答1: You won't be able to use Spark's partitionBy to achieve this. Instead, you have to break your DataFrame into its

Loop through dataframe and update the lookup table simultaneously: spark scala

阅读更多关于 Loop through dataframe and update the lookup table simultaneously: spark scala

问题 I have a DataFrame like the following. +---+-------------+-----+ | id|AccountNumber|scale| +---+-------------+-----+ | 1| 1500847| 6| | 2| 1501199| 7| | 3| 1119024| 3| +---+-------------+-----+ I have to populate a second DataFrame , which would initially be empty, as follows. id AccountNumber scale 1 1500847 6 2 1501199 6 3 1119024 3 Output explaination First row in the first DataFrame has a scale of 6. Check for that value minus 1 (so scale equals 5) in the result. There none, so simply add

how to get input file name of a record in spark dataframe?

阅读更多关于 how to get input file name of a record in spark dataframe?

问题 I am creating a dataframe in spark by loading tab separated files from s3. I need to get the input file name information of each record in the dataframe for further processing. I tried dataframe.select(inputFileName()) But I am getting null value for input_file_name. somebody please help me to solve this issue. 回答1: You can create a new column on the data frame using withColumn and input_file_name() : dataframe.withColumn("input_file", input_file_name()) 来源： https://stackoverflow.com

How to get count of invalid data during parse

阅读更多关于 How to get count of invalid data during parse

问题 We are using spark to parse a big csv file, which may contain invalid data. We want to save valid data into the data store, and also return how many valid data we imported and how many invalid data. I am wondering how we can do this in spark, what's the standard approach when reading data? My current approach uses Accumulator , but it's not accurate due to how Accumulator works in spark. // we define case class CSVInputData: all fields are defined as string val csvInput = spark.read.option(

Spark Data Frame write to parquet table - slow at updating partition stats

阅读更多关于 Spark Data Frame write to parquet table - slow at updating partition stats

问题 When I write data from dataframe into parquet table ( which is partitioned ) after all the tasks are successful, process is stuck at updating partition stats. 16/10/05 03:46:13 WARN log: Updating partition stats fast for: 16/10/05 03:46:14 WARN log: Updated size to 143452576 16/10/05 03:48:30 WARN log: Updating partition stats fast for: 16/10/05 03:48:31 WARN log: Updated size to 147382813 16/10/05 03:51:02 WARN log: Updating partition stats fast for: df.write.format("parquet").mode(