spark-dataframe | 易学教程

How to transform DataFrame before joining operation?

阅读更多关于 How to transform DataFrame before joining operation?

问题 The following code is used to extract ranks from the column products . The ranks are second numbers in each pair [...] . For example, in the given example [[222,66],[333,55]] the ranks are 66 and 55 for products with PK 222 and 333 , accordingly. But the code in Spark 2.2 works very slowly when df_products is around 800 Mb: df_products.createOrReplaceTempView("df_products") val result = df.as("df2") .join(spark.sql("SELECT * FROM df_products") .select($"product_PK", explode($"products").as(

How to replace empty values in a column of DataFrame?

阅读更多关于 How to replace empty values in a column of DataFrame?

问题 How can I replace empty values in a column Field1 of DataFrame df ? Field1 Field2 AA 12 BB This command does not provide an expected result: df.na.fill("Field1",Seq("Anonymous")) The expected result: Field1 Field2 Anonymous AA 12 BB 回答1: Fill: Returns a new DataFrame that replaces null or NaN values in numeric columns with value. Two things: An empty string is not null or NaN, so you'll have to use a case statement for that. Fill seems to not work well when giving a text value into a numeric

OUTER JOIN on 2 DATA FRAMES : Spark Scala SqlContext

阅读更多关于 OUTER JOIN on 2 DATA FRAMES : Spark Scala SqlContext

问题 I am getting error while doing outer joins on 2 data frames. I am trying to get the percentile. val sqlContext = new org.apache.spark.sql.SQLContext(sc) val df = sqlContext.jsonFile("temp.txt") val res = df.withColumn("visited", explode($"visited")) val result1 =res.groupBy($"customerId", $"visited.placeName").agg(count("*").alias("total")) val result2 = res .filter($"visited.rating" < 4) .groupBy($"customerId", $"visited.placeName") .agg(count("*").alias("top")) result1.show() result2.show()

How to force Spark to evaluate DataFrame operations inline

阅读更多关于 How to force Spark to evaluate DataFrame operations inline

问题 According to the Spark RDD docs: All transformations in Spark are lazy, in that they do not compute their results right away...This design enables Spark to run more efficiently. There are times when I need to do certain operations on my dataframes right then and now . But because dataframe ops are " lazily evaluated " (per above), when I write these operations in the code, there's very little guarantee that Spark will actually execute those operations inline with the rest of the code. For

Array Intersection in Spark SQL

阅读更多关于 Array Intersection in Spark SQL

问题 I have a table with a array type column named writer which has the values like array[value1, value2] , array[value2, value3] .... etc. I am doing self join to get results which have common values between arrays. I tried: sqlContext.sql("SELECT R2.writer FROM table R1 JOIN table R2 ON R1.id != R2.id WHERE ARRAY_INTERSECTION(R1.writer, R2.writer)[0] is not null ") And sqlContext.sql("SELECT R2.writer FROM table R1 JOIN table R2 ON R1.id != R2.id WHERE ARRAY_INTERSECT(R1.writer, R2.writer)[0] is

Converting Pandas dataframe into Spark dataframe error

阅读更多关于 Converting Pandas dataframe into Spark dataframe error

问题 I'm trying to convert Pandas DF into Spark one. DF head: 10000001,1,0,1,12:35,OK,10002,1,0,9,f,NA,24,24,0,3,9,0,0,1,1,0,0,4,543 10000001,2,0,1,12:36,OK,10002,1,0,9,f,NA,24,24,0,3,9,2,1,1,3,1,3,2,611 10000002,1,0,4,12:19,PA,10003,1,1,7,f,NA,74,74,0,2,15,2,0,2,3,1,2,2,691 Code: dataset = pd.read_csv("data/AS/test_v2.csv") sc = SparkContext(conf=conf) sqlCtx = SQLContext(sc) sdf = sqlCtx.createDataFrame(dataset) And I got an error: TypeError: Can not merge type <class 'pyspark.sql.types

How to improve performance for slow Spark jobs using DataFrame and JDBC connection?

阅读更多关于 How to improve performance for slow Spark jobs using DataFrame and JDBC connection?

问题 I am trying to access a mid-size Teradata table (~100 million rows) via JDBC in standalone mode on a single node (local[*]). I am using Spark 1.4.1. and is setup on a very powerful machine(2 cpu, 24 cores, 126G RAM). I have tried several memory setup and tuning options to make it work faster, but neither of them made a huge impact. I am sure there is something I am missing and below is my final try that took about 11 minutes to get this simple counts vs it only took 40 seconds using a JDBC

Unable to read, and later query text file in Apache Spark

阅读更多关于 Unable to read, and later query text file in Apache Spark

问题 So I am trying to implement the example Spark Programming Example using a dataset available with us. It is a file which is separated by | . However it throws the following error, even after following the instructions as given. I can see it is unable to "cast" an object of one instance into another, any advice as to how to handle the scenario. Caused by: java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD

how to change header of a data frame with another data frame header?

阅读更多关于 how to change header of a data frame with another data frame header?

问题 I have a data set which looks like this LineItem.organizationId|^|LineItem.lineItemId|^|StatementTypeCode|^|LineItemName|^|LocalLanguageLabel|^|FinancialConceptLocal|^|FinancialConceptGlobal|^|IsDimensional|^|InstrumentId|^|LineItemSequence|^|PhysicalMeasureId|^|FinancialConceptCodeGlobalSecondary|^|IsRangeAllowed|^|IsSegmentedByOrigin|^|SegmentGroupDescription|^|SegmentChildDescription|^|SegmentChildLocalLanguageLabel|^|LocalLanguageLabel.languageId|^|LineItemName.languageId|^

Another instance of Derby may have already booted the database /home/cloudera/metastore_db

阅读更多关于 Another instance of Derby may have already booted the database /home/cloudera/metastore_db

问题 I am trying to load a normal text file into a hive table using Spark. I am using Spark version 2.0.2. I have done it successfully in Spark version: 1.6.0 and I am trying to do the same in version 2x I executed the below steps: import org.apache.spark.sql.SparkSession val spark = SparkSession.builder().appName("SparkHiveLoad").master("local").enableHiveSupport().getOrCreate() import spark.implicits._ There is no problem until now. But when I try to load the file into Spark: val partfile =